Devo recuperare l’elaborazione del post precedente, fatto; non la riporto. (L’interattività della REPL è comoda ma a volte, come adesso…).
GroupBy object is a very flexible abstraction. In many ways, you can simply treat it as if it’s a collection of
DataFrames, and it does the difficult things under the hood. Let’s see some examples using the Planets data.
Perhaps the most important operations made available by a
apply. We’ll discuss each of these more fully in “Aggregate, Filter, Transform, Apply” [post precedente], but before that let’s introduce some of the other functionality that can be used with the basic
indicizzazione per colonna
GroupBy object supports column indexing in the same way as the
DataFrame, and returns a modified
GroupBy object. For example:
Here we’ve selected a particular
Series group from the original
DataFrame group by reference to its column name. As with the
GroupBy object, no computation is done until we call some aggregate on the object:
This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.
iterare per gruppi
GroupBy object supports direct iteration over the groups, returning each group as a
This can be useful for doing certain things manually, though it is often much faster to use the built-in apply functionality, which we will discuss momentarily.
metodi di espletazione (dispatch)
Through some Python class magic, any method not explicitly implemented by the
GroupBy object will be passed through and called on the groups, whether they are
Series objects. For example, you can use the
describe() method of
DataFrames to perform a set of aggregations that describe each group in the data:
Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.
This is just one example of the utility of dispatch methods. Notice that they are applied to each individual group, and the results are then combined within
GroupBy and returned. Again, any valid
DataFrame/Series method can be used on the corresponding
GroupBy object, which allows for some very flexible and powerful operations!
The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have
apply() methods that efficiently implement a variety of useful operations before combining the grouped data.
For the purpose of the following subsections, we’ll use this
We’re now familiar with
GroupBy aggregations with
median(), and the like, but the
aggregate() method allows for even more flexibility. It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is a quick example combining all these:
Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:
A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.
While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean:
apply() method lets you apply an arbitrary function to the group results. The function should take a
DataFrame, and return either a Pandas object (e.g.,
Series) or a scalar; the combine operation will be tailored to the type of output returned.
For example, here is an
apply() that normalizes the first column by the sum of the second:
apply() within a
GroupBy is quite flexible: the only criterion is that the function takes a
DataFrame and returns a Pandas object or scalar; what you do in the middle is up to you!
Specificare la key di suddivisione
In the simple examples presented before, we split the
DataFrame on a single column name. This is just one of many options by which the groups can be defined, and we’ll go through some other options for group specification here.
una lista, array, serie o indice che fornisce la chiave di raggruppamento
The key can be any series or list with a length matching that of the DataFrame. For example:
Of course, this means there’s another, more verbose way of accomplishing the
df.groupby('key') from before:
un dictionary o serie che mappa un indice in un gruppo
Another method is to provide a dictionary that maps index values to the group keys:
una funzione Python qualunque
Similar to mapping, you can pass any Python function that will input the index value and output the group:
una lista di keys valide
Further, any of the preceding key choices can be combined to group on a multi-index:
Esempio di raggruppamento
As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:
This shows the power of combining many of the operations we’ve discussed up to this point when looking at realistic datasets. We immediately gain a coarse understanding of when and how planets have been discovered over the past several decades!
Here I would suggest digging into these few lines of code, and evaluating the individual steps to make sure you understand exactly what they are doing to the result. It’s certainly a somewhat complicated example, but understanding these pieces will give you the means to similarly explore your own data.
Già detto che Jake 🚀 rockz!?