Esempio: data di nascita
As a more interesting example, let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found here (this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, this blog post):
We can start to understand this data a bit more by using a pivot table. Let’s add a decade column, and take a look at male and female births as a function of decade:
We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visualize the total number of births by year (see Introduction to Matplotlib for a discussion of plotting with Matplotlib [prossimamente]):
With a simple pivot table and
plot() method, we can immediately see the annual trend in births by gender. By eye, it appears that over the past 50 years male births have outnumbered female births by around 5%.
Ulteriori esplorazione dei dati
Though this doesn’t necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation:
This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution (You can learn more about sigma-clipping operations in a book I coauthored with Željko Ivezić, Andrew J. Connolly, and Alexander Gray: “Statistics, Data Mining, and Machine Learning in Astronomy” (Princeton University Press, 2014)).
With this we can use the
query() method (discussed further in High-Performance Pandas:
query()) [prossimamente] to filter-out rows with births outside these values:
Next we set the day column to integers; previously it had been a string because some columns in the dataset contained the value ‘
Finally, we can combine the day, month, and year to create a Date index (see Working with Time Series [prossimamente]). This allows us to quickly compute the weekday corresponding to each row:
Using this we can plot births by weekday for several decades:
Nota: dimenticato l’istruzione
plt.ylabel('mean births by day').
Apparently births are slightly less common on weekends than on weekdays! Note that the 1990s and 2000s are missing because the CDC data contains only the month of birth starting in 1989.
Another intersting view is to plot the mean number of births by the day of the year. Let’s first group the data by month and day separately:
Focusing on the month and day only, we now have a time series reflecting the average number of births by date of the year. From this, we can use the plot method to plot the data. It reveals some interesting trends:
In particular, the striking feature of this graph is the dip in birthrate on US holidays (e.g., Independence Day, Labor Day, Thanksgiving, Christmas, New Year’s Day) although this likely reflects trends in scheduled/induced births rather than some deep psychosomatic effect on natural births. For more discussion on this trend, see the analysis and links in [stesso link pecedente] on the subject. We’ll return to this figure in Example:-Effect-of-Holidays-on-US-Births [prossimamente], where we will use Matplotlib’s tools to annotate this plot.
Looking at this short example, you can see that many of the Python and Pandas tools we’ve seen to this point can be combined and used to gain insight from a variety of datasets. We will see some more sophisticated applications of these data manipulations in future sections!