Un esempio d’uso di Pandas su un argomento in cui Jake la sa lunga 🚀. Siccome è lungo lo suddivido in più posts.

An essential piece of analysis of large data is efficient summarization: computing aggregations like ** sum()**,

**,**

`mean()`

**,**

`median()`

**, and**

`min()`

**, in which a single number gives insight into the nature of a potentially large dataset. In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby.**

`max()`

For convenience, we’ll use the same display magic function that we’ve seen in previous sections –non lo ricopio, è sempre lo stesso.

**Dati relativi ai pianeti**

Here we will use the Planets dataset, available via the Seaborn package (see Visualization With Seaborn [prossimamente]). It gives information on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). It can be downloaded with a simple Seaborn command:

This has some details on the 1,000+ extrasolar planets discovered up to 2014.

**Aggregazione semplice con Pandas**

Earlier, we explored some of the data aggregations available for NumPy arrays (“Aggregations: Min, Max, and Everything In Between” [qui]). As with a one-dimensional NumPy array, for a Pandas Series the aggregates return a single value:

For a ** DataFrame**, by default the aggregates return results within each column:

By specifying the axis argument, you can instead aggregate within each row:

Pandas ** Series** and

**s include all of the common aggregates mentioned in Aggregations: Min, Max, and Everything In Between [stesso link precedente]; in addition, there is a convenience method**

`DataFrame`

**that computes several common aggregates for each column and returns the result. Let’s use this on the Planets data, for now dropping rows with missing values:**

`describe()`

This can be a useful way to begin understanding the overall properties of a dataset. For example, we see in the year column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after. This is largely thanks to the Kepler mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.

The following table summarizes some other built-in Pandas aggregations:

`Aggregation Description count() Total number of items first(), last() First and last item mean(), median() Mean and median min(), max() Minimum and maximum std(), var() Standard deviation and variance mad() Mean absolute deviation prod() Product of all items sum() Sum of all items`

These are all methods of ** DataFrame** and

**objects.**

`Series`

To go deeper into the data, however, simple aggregates are often not enough. The next level of data summarization is the groupby operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

**Raggruppamenti, GroupBy:** `split`

,`apply`

,`combine`

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby operation. The name “group by” comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: ** split**,

**,**

`apply`

**.**

`combine`

`split`

,`apply`

,`combine`

A canonical example of this split-apply-combine operation, where the “** apply**” is a summation aggregation, is illustrated in this figure:

This makes clear what the groupby accomplishes:

- The
step involves breaking up and grouping a`split`

depending on the value of the specified key.`DataFrame`

- The
step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.`apply`

- The
step merges the results of these operations into an output array.`combine`

While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that the intermediate splits do not need to be explicitly instantiated. Rather, the GroupBy can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way. The power of the GroupBy is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.

As a concrete example, let’s take a look at using Pandas for the computation shown in this diagram. We’ll start by creating the input ** DataFrame**:

The most basic split-apply-combine operation can be computed with the ** groupby()** method of

**s, passing the name of the desired key column:**

`DataFrame`

Notice that what is returned is not a set of ** DataFrame**s, but a

**object. This object is where the magic is: you can think of it as a special view of the**

`DataFrameGroupBy`

**, which is poised to dig into the groups but does no actual computation until the aggregation is applied. This “lazy evaluation” approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.**

`DataFrame`

To produce a result, we can apply an aggregate to this ** DataFrameGroupBy** object, which will perform the appropriate apply/combine steps to produce the desired result:

The ** sum()** method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid

**operation, as we will see in the following discussion.**

`DataFrame`

Continua 😉

## Trackbacks

[…] da qui, copio […]

[…] returns a view similar to what we saw with the groupby operation (see Aggregation and Grouping [qui]). This rolling view makes available a number of aggregation operations by […]

[…] Investigare i dati While these smoothed data views are useful to get an idea of the general trend in the data, they hide much of the interesting structure. For example, we might want to look at the average traffic as a function of the time of day. We can do this using the GroupBy functionality discussed in Aggregation and Grouping [qui]: […]

[…] grafici a barre Time series can be plotted using sns.factorplot. In the following example, we’ll use the Planets data that we first saw in Aggregation and Grouping [qui]: […]