Tour De France is undoubtedly the most prestigious race of the many that happen across the world. I was blissfully unaware of the magnitude and the scale of this event, in fact I had little knowledge of it beyond the Lance Armstrong doping scandal and the occasional posts that popped up on my social media news feed. I was surprised when I found out the race happens to be over a 100-years-old! The first race was held in 1903 to increase the sales of a paper in France. What started off as a marketing gimmick grew on to become one of the most followed sporting event around the globe.
My Cousin happens to be a huge fan of the sport and while visiting him not only did I end up watching multiple stages of the 2016 race, but dinner table conversations also revolved around it. The more interested I got the more questions started popping in my head. Was it the human or the bicycle or a combination of both? Had better equipment made cycling longer distances possible? Were people cycling faster?
Since the first race in 1903, the race has been happening every year with the exception of the years coinciding with the two World Wars. A lot has changed since then, the speed, the technology, the duration and the fan base. The number of participants have gone up from the time the race started and so has the percentage of people who finish the race. The longest race was held in 1926 with 5745 km over 39 days. Except for 2003 the race distance has been under 3600 km for the last 10 years.
Finding the data
While hunting for race statistics I came across this site ( http://www.bikeraceinfo.com/index.html ) which has some great Tour de France data. I used Python with Pandas, numpy and matplotlib for parsing,cleaning,editing and plotting. The first step was to extract the table from the webpage. The output was a list of tables and the required table was obtained from the list. After that the data had to be cleaned and normalized before plotting. All plots were created using Matplotlib.
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline
tdf = pd.read_html('http://bikeraceinfo.com/tdf/tdfstats.html') tdf_final = tdf[2]
Graphs
The scatter plot below shows a logical trend, the average speed decreases as the distance covered in a race increases. The outliers on the left bottom of the graph are of the races from 1903-1905 which had both lower distances and speed.

The scatter plot for average speed of the winner vs Total distance

Race distance by the year

Race Distance and Speed by the year
The graph shows the average speed of the winners have been increasing from 25.27 km/h in 1903 to 39.64 in 2015. While the average speed has been increasing, the race distance has been decreasing. Last 10 years have seen race distances of under 3,600 km.

Race Duration(in Days) and Number of Stages by the year

Entrants and Finishers by the year
The number of participants in the race has been increasing. From 60 in 1903, the 2015 race saw 198 participants. The proportion of people who complete the race has also increased over the years.

This should be obvious, even the scatter plot agrees. Races with longer duration have more distance covered
We can obtain pair plots using pairplot in seaborn
import seaborn as sns sns.set_style("whitegrid") plt.figure() sns.pairplot(data=tdf[["N_Duration","N_Length","N_Entrant","N_Finished","N_Avg Speed"]], dropna=True) plt.savefig("C://Python//1_seaborn_pair_plot.png")

The pair plot

3D Scatter Plot of Distance, Duration and Avg Speed of the winner
Very insightful! I like the look and feel of the content presented too!
LikeLike