In this iPython notebook, I will go over the data cleaning and exploration process to prepare the Olympics Medal Count dataset for visualization with d3
There are a number of steps that are involved in creating a final visualization. I first read in the data, do some preliminary data cleaning, and then go back and forth between transforming and visualizing until I reach a final product. As one gets closer and closer to the final product, the types of tools used will be less agile and allow for additional detail fine-tuning.
%cd ~/Google\ Drive/Datasets/
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import image
import brewer2mpl
# Set up some better defaults for matplotlib
from matplotlib import rcParams
#colorbrewer2 Dark2 qualitative color table
dark2_colors = brewer2mpl.get_map('Dark2', 'Qualitative', 7).mpl_colors
rcParams['figure.figsize'] = (14, 10)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'
def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
"""
Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
"""
ax = axes or plt.gca()
ax.spines['top'].set_visible(top)
ax.spines['right'].set_visible(right)
ax.spines['left'].set_visible(left)
ax.spines['bottom'].set_visible(bottom)
#turn off all ticks
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
#now re-enable visibles
if top:
ax.xaxis.tick_top()
if bottom:
ax.xaxis.tick_bottom()
if left:
ax.yaxis.tick_left()
if right:
ax.yaxis.tick_right()
%matplotlib inline
olympic_medalists_file = 'Summer Olympic medallists 1896 to 2008 - ALL MEDALISTS.csv'
df = pd.read_csv(olympic_medalists_file, sep=',')
df.rename(columns={'Edition': 'Year'}, inplace=True)
df.head()
This is a rich data set. The first thing that jumps out at me, though, is the gender column. Let's take a look at the gender data and create some visualizations before formulating the question we are hoping to answer in our final visualization.
gender_counts = df['Gender'].value_counts()
gender_counts
df['Event_gender'].value_counts()
It seems that there are some differences across the two gender columns. The 'Gender' column appears to be the most complete of the two, so we will proceed with that one.
gender_group = df.groupby('Gender')
gender_group.size()
gender_counts_df = pd.DataFrame(gender_counts, columns=['Number of Medals'])
rcParams['figure.figsize'] = (14, 10)
fig = gender_counts_df.plot(kind='bar')
remove_border()
Interesting! It seems that there is a significant differenc in the number of medals available to men vs. women. I think that it would be interesting take a look at this difference over time, and see how the gender breakdown has evolved over the Olympics' lifespan.
year_and_gender = df.groupby(['Year', 'Gender'])
year_and_gender_df = pd.DataFrame(year_and_gender.size(), columns=['Medal Count'])
year_and_gender_df['Year'] = year_and_gender_df.index.get_level_values('Year')
year_and_gender_df['Gender'] = year_and_gender_df.index.get_level_values('Gender')
year_and_gender_df = year_and_gender_df.reset_index(drop=True)
year_and_gender_df.head()
year_and_gender_men_df = year_and_gender_df.ix[year_and_gender_df['Gender'] == 'Men',]
year_and_gender_women_df = year_and_gender_df.ix[year_and_gender_df['Gender'] == 'Women',]
years = year_and_gender_df['Year']
x = np.arange(min(years), max(years), step=4)
y_men = year_and_gender_men_df['Medal Count']
y_women = year_and_gender_women_df['Medal Count']
width=1.5
fig, ax = plt.subplots()
rects1 = ax.bar(year_and_gender_men_df.Year, year_and_gender_men_df['Medal Count'], width=width)
rects2 = ax.bar(year_and_gender_women_df.Year, year_and_gender_women_df['Medal Count'], color='y', width=width)
ax.set_ylabel('Number of Medals')
ax.set_xlabel('Year')
ax.set
ax.legend((rects1[0], rects2[0]), ('Men', 'Women'))
remove_border()
plt.show()
This visualization looks great. We can already see an interesting trend, as well as some interesting points on our graph. We can see the period of time when the Olympics did not happen due to the World Wars, as well as a spike in the medal count during 1920. I make a mental note to do some more research on these topics for our final visualization, and will now write our dataset to disk, which we will next be reading into d3 for our final visualization.
year_and_gender_df.to_csv('Year_and_Gender_Olympics_Dataset_All_Years.csv', index=False)