**Esempio: esplorare i tempi impiegati per la maratona**

Here we’ll look at using Seaborn to help visualize and understand finishing results from a marathon. I’ve scraped the data from sources on the Web, aggregated it and removed any identifying information, and put it on GitHub where it can be downloaded (if you are interested in using Python for web scraping, I would recommend Web Scraping with Python by Ryan Mitchell). We will start by downloading the data from the Web, and loading it into Pandas:

Let’s fix this by providing a converter for the times:

Qui ci sono un po’ di funzioni deprecate, non vale come al solito ignorare i warnings, ho corretto il codice di Jake

That looks much better. For the purpose of our Seaborn plotting utilities, let’s next add columns that give the times in seconds:

To get an idea of what the data looks like, we can plot a ** jointplot** over the data:

The dotted line shows where someone’s time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon. If you have run competitively, you’ll know that those who do the opposite—run faster during the second half of the race—are said to have “negative-split” the race.

Let’s create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:

Where this split difference is less than zero, the person negative-split the race by that fraction. Let’s do a distribution plot of this split fraction:

Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.

Let’s see whether there is any correlation between this split fraction and other variables. We’ll do this using a ** pairgrid**, which draws plots of all these correlations:

It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time. (We see here that Seaborn is no panacea for Matplotlib’s ills when it comes to plot styles: in particular, the x-axis labels overlap. Because the output is a simple Matplotlib plot, however, the methods in Customizing Ticks [qui] can be used to adjust such things if desired.)

The difference between men and women here is interesting. Let’s look at the histogram of split fractions for these two groups:

The interesting thing here is that there are many more men than women who are running close to an even split! This almost looks like some kind of bimodal distribution among the men and women. Let’s see if we can suss-out what’s going on by looking at the distributions as a function of age.

A nice way to compare distributions is to use a ** violin** plot

This is yet another way to compare the distributions between men and women.

Let’s look a little deeper, and compare these ** violin** plots as a function of age. We’ll start by creating a new column in the array that specifies the decade of age that each person is in:

Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s to 50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter).

Also surprisingly, the 80-year-old women seem to outperform everyone in terms of their split time. This is probably due to the fact that we’re estimating the distribution from small numbers, as there are only a handful of runners in that range:

Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We’ll use ** regplot**, which will automatically fit a linear regression to the data:

Apparently the people with fast splits are the elite runners who are finishing within ~15,000 seconds, or about 4 hours. People slower than that are much less likely to have a fast second split.

## Trackback

[…] da qui, copio […]