Category Archives: Python

NumPy – 55 – aggregare e raggruppare – 2

Continuo da qui, copio qui.

Devo recuperare l’elaborazione del post precedente, fatto; non la riporto. (L’interattività della REPL è comoda ma a volte, come adesso…).

L’oggetto GroupBy
The GroupBy object is a very flexible abstraction. In many ways, you can simply treat it as if it’s a collection of DataFrames, and it does the difficult things under the hood. Let’s see some examples using the Planets data.

Perhaps the most important operations made available by a GroupBy are aggregate, filter, transform, and apply. We’ll discuss each of these more fully in “Aggregate, Filter, Transform, Apply” [post precedente], but before that let’s introduce some of the other functionality that can be used with the basic GroupBy operation.

indicizzazione per colonna
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a modified GroupBy object. For example:

Here we’ve selected a particular Series group from the original DataFrame group by reference to its column name. As with the GroupBy object, no computation is done until we call some aggregate on the object:

This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.

iterare per gruppi
The GroupBy object supports direct iteration over the groups, returning each group as a Series or DataFrame:

This can be useful for doing certain things manually, though it is often much faster to use the built-in apply functionality, which we will discuss momentarily.

metodi di espletazione (dispatch)
Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through and called on the groups, whether they are DataFrame or Series objects. For example, you can use the describe() method of DataFrames to perform a set of aggregations that describe each group in the data:

Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.

This is just one example of the utility of dispatch methods. Notice that they are applied to each individual group, and the results are then combined within GroupBy and returned. Again, any valid DataFrame/Series method can be used on the corresponding GroupBy object, which allows for some very flexible and powerful operations!

aggregate, filter, transform e apply
The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.

For the purpose of the following subsections, we’ll use this DataFrame:

We’re now familiar with GroupBy aggregations with sum(), median(), and the like, but the aggregate() method allows for even more flexibility. It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is a quick example combining all these:

Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.

While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean:

il metodo apply()
The apply() method lets you apply an arbitrary function to the group results. The function should take a DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to the type of output returned.

For example, here is an apply() that normalizes the first column by the sum of the second:

apply() within a GroupBy is quite flexible: the only criterion is that the function takes a DataFrame and returns a Pandas object or scalar; what you do in the middle is up to you!

Specificare la key di suddivisione
In the simple examples presented before, we split the DataFrame on a single column name. This is just one of many options by which the groups can be defined, and we’ll go through some other options for group specification here.

una lista, array, serie o indice che fornisce la chiave di raggruppamento
The key can be any series or list with a length matching that of the DataFrame. For example:

Of course, this means there’s another, more verbose way of accomplishing the df.groupby('key') from before:

un dictionary o serie che mappa un indice in un gruppo
Another method is to provide a dictionary that maps index values to the group keys:

una funzione Python qualunque
Similar to mapping, you can pass any Python function that will input the index value and output the group:

una lista di keys valide
Further, any of the preceding key choices can be combined to group on a multi-index:

Esempio di raggruppamento
As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:

This shows the power of combining many of the operations we’ve discussed up to this point when looking at realistic datasets. We immediately gain a coarse understanding of when and how planets have been discovered over the past several decades!

Here I would suggest digging into these few lines of code, and evaluating the individual steps to make sure you understand exactly what they are doing to the result. It’s certainly a somewhat complicated example, but understanding these pieces will give you the means to similarly explore your own data.

Già detto che Jake 🚀 rockz!?


NumPy – 54 – aggregare e raggruppare – 1

Continuo da qui, copio qui.

Un esempio d’uso di Pandas su un argomento in cui Jake la sa lunga 🚀. Siccome è lungo lo suddivido in più posts.

An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives insight into the nature of a potentially large dataset. In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby.

For convenience, we’ll use the same display magic function that we’ve seen in previous sections –non lo ricopio, è sempre lo stesso.

Dati relativi ai pianeti
Here we will use the Planets dataset, available via the Seaborn package (see Visualization With Seaborn [prossimamente]). It gives information on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). It can be downloaded with a simple Seaborn command:

This has some details on the 1,000+ extrasolar planets discovered up to 2014.

Aggregazione semplice con Pandas
Earlier, we explored some of the data aggregations available for NumPy arrays (“Aggregations: Min, Max, and Everything In Between” [qui]). As with a one-dimensional NumPy array, for a Pandas Series the aggregates return a single value:

For a DataFrame, by default the aggregates return results within each column:

By specifying the axis argument, you can instead aggregate within each row:

Pandas Series and DataFrames include all of the common aggregates mentioned in Aggregations: Min, Max, and Everything In Between [stesso link precedente]; in addition, there is a convenience method describe() that computes several common aggregates for each column and returns the result. Let’s use this on the Planets data, for now dropping rows with missing values:

This can be a useful way to begin understanding the overall properties of a dataset. For example, we see in the year column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after. This is largely thanks to the Kepler mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.

The following table summarizes some other built-in Pandas aggregations:

Aggregation       Description
count()           Total number of items
first(), last()   First and last item
mean(), median()  Mean and median
min(), max()      Minimum and maximum
std(), var()      Standard deviation and variance
mad()             Mean absolute deviation
prod()            Product of all items
sum()             Sum of all items

These are all methods of DataFrame and Series objects.

To go deeper into the data, however, simple aggregates are often not enough. The next level of data summarization is the groupby operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

Raggruppamenti, GroupBy: split, apply, combine
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby operation. The name “group by” comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.

split, apply, combine
A canonical example of this split-apply-combine operation, where the “apply” is a summation aggregation, is illustrated in this figure:

This makes clear what the groupby accomplishes:

  • The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
  • The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
  • The combine step merges the results of these operations into an output array.

While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that the intermediate splits do not need to be explicitly instantiated. Rather, the GroupBy can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way. The power of the GroupBy is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.

As a concrete example, let’s take a look at using Pandas for the computation shown in this diagram. We’ll start by creating the input DataFrame:

The most basic split-apply-combine operation can be computed with the groupby() method of DataFrames, passing the name of the desired key column:

Notice that what is returned is not a set of DataFrames, but a DataFrameGroupBy object. This object is where the magic is: you can think of it as a special view of the DataFrame, which is poised to dig into the groups but does no actual computation until the aggregation is applied. This “lazy evaluation” approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.

To produce a result, we can apply an aggregate to this DataFrameGroupBy object, which will perform the appropriate apply/combine steps to produce the desired result:

The sum() method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid DataFrame operation, as we will see in the following discussion.

Continua 😉


NumPy – 53 – combinare dataset con merge e join – 4

Continuo da qui, copio qui.

Per concludere il capitolo un esempio che mette in pratica quanto visto nei posts precedenti.

Esempio: dati degli Stati Uniti
Merge and join operations come up most often when combining data from different sources. Here we will consider an example of some data about US states and their populations. The data files can be found here.

Jake, rockz! 🚀 ci dice anche come fare per scaricare i dati:

# Following are shell commands to download the data
curl -O
curl -O
curl -O

OK, sposto i files nella sub-dir data e vado:

Given this information, say we want to compute a relatively straightforward result: rank US states and territories by their 2010 population density. We clearly have the data here to find this result, but we’ll have to combine the datasets to find the result.

We’ll start with a many-to-one merge that will give us the full state name within the population DataFrame. We want to merge based on the state/region column of pop, and the abbreviation column of abbrevs. We’ll use how='outer' to make sure no data is thrown away due to mismatched labels.

Let’s double-check whether there were any mismatches here, which we can do by looking for rows with nulls:

Some of the population info is null; let’s figure out which these are!

It appears that all the null population values are from Puerto Rico prior to the year 2000; this is likely due to this data not being available from the original source.

More importantly, we see also that some of the new state entries are also null, which means that there was no corresponding entry in the abbrevs key! Let’s figure out which regions lack this match:

We can quickly infer the issue: our population data includes entries for Puerto Rico (PR) and the United States as a whole (USA), while these entries do not appear in the state abbreviation key. We can fix these quickly by filling in appropriate entries:

No more nulls in the state column: we’re all set!

Now we can merge the result with the area data using a similar procedure. Examining our results, we will want to join on the state column in both:

Again, let’s check for nulls to see if there were any mismatches:

There are nulls in the area column; we can take a look to see which regions were ignored here:

We see that our areas DataFrame does not contain the area of the United States as a whole. We could insert the appropriate value (using the sum of all state areas, for instance), but in this case we’ll just drop the null values because the population density of the entire United States is not relevant to our current discussion:

Now we have all the data we need. To answer the question of interest, let’s first select the portion of the data corresponding with the year 2000, and the total population. We’ll use the query() function to do this quickly (this requires the numexpr package to be installed; see High-Performance Pandas: eval() and query()) [prossimamente]:

Now let’s compute the population density and display it in order. We’ll start by re-indexing our data on the state, and then compute the result:

The result is a ranking of US states plus Washington, DC, and Puerto Rico in order of their 2010 population density, in residents per square mile. We can see that by far the densest region in this dataset is Washington, DC (i.e., the District of Columbia); among states, the densest is New Jersey.

We can also check the end of the list:

We see that the least dense state, by far, is Alaska, averaging slightly over one resident per square mile.

This type of messy data merging is a common task when trying to answer questions using real-world data sources. I hope that this example has given you an idea of the ways you can combine tools we’ve covered in order to gain insight from your data!

Pandas & Jake rockzs 🚀


NumPy – 52 – combinare dataset con merge e join – 3

Continuo da qui, copio qui.

Specificare set aritmetici per le unioni
In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join. This comes up when a value appears in one key column but not the other. Consider this example:

Here we have merged two datasets that have only a single “name” entry in common: Mary. By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join. We can specify this explicitly using the how keyword, which defaults to "inner":

Other options for the how keyword are 'outer', 'left', and 'right'. An outer join returns a join over the union of the input columns, and fills in all missing values with NAs:

The left join and right join return joins over the left entries and right entries, respectively. For example:

The output rows now correspond to the entries in the left input. Using how='right' works in a similar manner.

All of these options can be applied straightforwardly to any of the preceding join types.

Conflitti con i nomi delle colonne: la keyword sufixes
Finally, you may end up in a case where your two input DataFrames have conflicting column names. Consider this example:

Because the output would have two conflicting column names, the merge function automatically appends a suffix _x or _y to make the output columns unique. If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword:

These suffixes work in any of the possible join patterns, and work also if there are multiple overlapping columns.

For more information on these patterns, see Aggregation and Grouping [prossimamente] where we dive a bit deeper into relational algebra. Also see the PandasMerge, Join and Concatenatedocumentation for further discussion of these topics.


NumPy – 51 – combinare dataset con merge e join – 2

Continuo da qui copiando qui.

Specifiche per merge
We’ve already seen the default behavior of pd.merge(): it looks for one or more matching column names between the two inputs, and uses this as the key. However, often the column names will not match so nicely, and pd.merge() provides a variety of options for handling this.

Riassunto per me, dal post precedente:

la keyword on
Most simply, you can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names:

This option works only if both the left and right DataFrames have the specified column name.

le keywords left_on e right_on
At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. In this case, we can use the left_on and right_on keywords to specify the two column names:

The result has a redundant column that we can drop if desired–for example, by using the drop() method of DataFrames:

le keywords left_index e right_index
Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:

You can use the index as the key for merging by specifying the left_index and/or right_index flags in pd.merge():

For convenience, DataFrames implement the join() method, which performs a merge that defaults to joining on indices:

If you’d like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior:

All of these options also work with multiple indices and/or multiple columns; the interface for this behavior is very intuitive. For more information on this, see the “Merge, Join, and Concatenate” section of the Pandas documentation.


NumPy – 50 – combinare dataset con merge e join – 1

Continuo da qui, copio qui.

One essential feature offered by Pandas is its high-performance, in-memory join and merge operations. If you have ever worked with databases, you should be familiar with this type of data interaction. The main interface for this is the pd.merge function, and we’ll see few examples of how this can work in practice.

For convenience, we will start by redefining the display() functionality from the previous section:

Algebra relazionale
The behavior implemented in pd.merge() is a subset of what is known as relational algebra, which is a formal set of rules for manipulating relational data, and forms the conceptual foundation of operations available in most databases. The strength of the relational algebra approach is that it proposes several primitive operations, which become the building blocks of more complicated operations on any dataset. With this lexicon of fundamental operations implemented efficiently in a database or other program, a wide range of fairly complicated composite operations can be performed.

Pandas implements several of these fundamental building-blocks in the pd.merge() function and the related join() method of Series and Dataframes. As we will see, these let you efficiently link data from different sources.

Categorie di unioni
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins. All three types of joins are accessed via an identical call to the pd.merge() interface; the type of join performed depends on the form of the input data. Here we will show simple examples of the three types of merges, and discuss detailed options further below.

unioni uno-a-uno
Perhaps the simplest type of merge expresion is the one-to-one join, which is in many ways very similar to the column-wise concatenation seen in Combining Datasets: Concat & Append [qui]. As a concrete example, consider the following two DataFrames which contain information on several employees in a company:

To combine this information into a single DataFrame, we can use the pd.merge() function:

The pd.merge() function recognizes that each DataFrame has an “employee” column, and automatically joins using this column as a key. The result of the merge is a new DataFrame that combines the information from the two inputs. Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the “employee” column differs between df1 and df2, and the pd.merge() function correctly accounts for this. Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index (see the left_index and right_index keywords, discussed momentarily).

unioni molti-a-uno
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate. Consider the following example of a many-to-one join:

The resulting DataFrame has an aditional column with the “supervisor” information, where the information is repeated in one or more locations as required by the inputs.

unioni molti-a-molti
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps most clear with a concrete example. Consider the following, where we have a DataFrame showing one or more skills associated with a particular group. By performing a many-to-many join, we can recover the skills associated with any individual person:

These three types of joins can be used with other Pandas tools to implement a wide array of functionality. But in practice, datasets are rarely as clean as the one we’re working with here. In the following section we’ll consider some of the options provided by pd.merge() that enable you to tune how the join operations work.


NumPy – 49 – combinare dati – concat e append – 2

Continuo da qui copiando qui.

Duplicare indici
One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate indices! Consider this simple example:

Notice the repeated indices in the result. While this is valid within DataFrames, the outcome is often undesirable. pd.concat() gives us a few ways to handle it.

Trovare le ripetizioni come errore
If you’d like to simply verify that the indices in the result of pd.concat() do not overlap, you can specify the verify_integrity flag. With this set to True, the concatenation will raise an exception if there are duplicate indices. Here is an example, where for clarity we’ll catch and print the error message:

    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

Ignorare l’indice
Sometimes the index itself does not matter, and you would prefer it to simply be ignored. This option can be specified using the ignore_index flag. With this set to true, the concatenation will create a new integer index for the resulting Series:

Mica capito 😯

Aggiungere keys MultiIndex
Another option is to use the keys option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:

The result is a multiply indexed DataFrame, and we can use the tools discussed in Hierarchical Indexing [qui] to transform this data into the representation we’re interested in.

Concatenare con join
In the simple examples we just looked at, we were mainly concatenating DataFrames with shared column names. In practice, data from different sources might have different sets of column names, and pd.concat offers several options in this case. Consider the concatenation of the following two DataFrames, which have some (but not all!) columns in common:

By default, the entries for which no data is available are filled with NA values. To change this, we can specify one of several options for the join and join_axes parameters of the concatenate function. By default, the join is a union of the input columns (join='outer'), but we can change this to an intersection of the columns using join='inner':

Another option is to directly specify the index of the remaininig colums using the join_axes argument, which takes a list of index objects. Here we’ll specify that the returned columns should be the same as those of the first input:

The combination of options of the pd.concat function allows a wide range of possible behaviors when joining two datasets; keep these in mind as you use these tools for your own data.

Il metodo append()
Because direct array concatenation is so common, Series and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes. For example, rather than calling pd.concat([df1, df2]), you can simply call df1.append(df2):

Keep in mind that unlike the append() and extend() methods of Python lists, the append() method in Pandas does not modify the original object–instead it creates a new object with the combined data. It also is not a very efficient method, because it involves creation of a new index and data buffer. Thus, if you plan to do multiple append operations, it is generally better to build a list of DataFrames and pass them all at once to the concat() function.

In the next section, we’ll look at another more powerful approach to combining data from multiple sources, the database-style merges/joins implemented in pd.merge. For more information on concat(), append(), and related functionality, see theMerge, Join, and Concatenatesection of the Pandas documentation.


NumPy – 48 – combinare dati – concat e append – 1

Continuo da qui, nuovo capitolo qui.

Some of the most interesting studies of data come from combining different data sources. These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets. Series and DataFrames are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.

Here we’ll take a look at simple concatenation of Series and DataFrames with the pd.concat function; later we’ll dive into more sophisticated in-memory merges and joins implemented in Pandas.

For convenience, we’ll define this function which creates a DataFrame of a particular form that will be useful below:

In addition, we’ll create a quick class that allows us to display multiple DataFrames side by side. The code makes use of the special _repr_html_ method, which IPython uses to implement its rich object display:

Nota: ho inserito l’immagine perché WordPress interpreta i codici HTML.
Qui ho fatto un po’ di pasticci. Ho copiato male dentro la REPL di IPython e il messaggio mi ha mandato in confusione. Come risultato non ho capito cosa doveva fare questo codice. Peraltro non essenziale. Devo stare più attento 👿

The use of this will become clearer as we continue our discussion in the following section.

Richiamo: concatenazione di NumPy arrays
Concatenation of Series and DataFrame objects is very similar to concatenation of Numpy arrays, which can be done via the np.concatenate function as discussed in The Basics of NumPy Arrays [qui]. Recall that with it, you can combine the contents of two or more arrays into a single array:

The first argument is a list or tuple of arrays to concatenate. Additionally, it takes an axis keyword that allows you to specify the axis along which the result will be concatenated:

Concatenazione semplice con pd.concat
Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a number of options that we’ll discuss momentarily:

Jake usa il codice seguente che a me da errore name 'objs' is not defined. Chissà se…

# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, 
          ignore_index=False, keys=None, levels=None, 
          names=None, verify_integrity=False, copy=True)

pd.concat() can be used for a simple concatenation of Series or DataFrame objects, just as np.concatenate() can be used for simple concatenations of arrays:

It also works to concatenate higher-dimensional objects, such as DataFrames:

By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0). Like np.concatenate, pd.concat allows specification of an axis along which concatenation will take place. Consider the following example:

La versione di Jake, più bella, mi da millantamila errori 😡 Uh! appena scoperto che sono dovuti tutti a axis='col' invece di axis=1. Ecco quindi la versione di Jake:


NumPy – 47 – indicizzazione gerarchica – 4

Continuo da qui, copiando qui.

Ri-arrangiare i multi-indici
One of the keys to working with multiply indexed data is knowing how to effectively transform the data. There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations. We saw a brief example of this in the stack() and unstack() methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we’ll explore them here.

indici sortati e non
Earlier, we briefly mentioned a caveat, but we should emphasize it more here. Many of the MultiIndex slicing operations will fail if the index is not sorted. Let’s take a look at this here.

We’ll start by creating some simple multiply indexed data where the indices are not lexographically sorted:

If we try to take a partial slice of this index, it will result in an error:

Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted. For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted (i.e., lexographical) order. Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index() and sortlevel() methods of the DataFrame. We’ll use the simplest, sort_index(), here:

With the index sorted in this way, partial slicing will work as expected:

impilare e dis-impilare indici
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:
Nota: dati dal post precedente.

The opposite of unstack() is stack(), which here can be used to recover the original series:

settare e dis-settare indici
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:

Often when working with data in the real world, the raw input data looks like this and it’s useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:

In practice, I find this type of reindexing to be one of the more useful patterns when encountering real-world datasets.

Aggregazione di dati con multi-indici
We’ve previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on.

For example, let’s return to our health data:

Perhaps we’d like to average-out the measurements in the two visits each year. We can do this by naming the index level we’d like to explore, in this case the year:

By further making use of the axis keyword, we can take the mean among levels on the columns as well:

Thus in two lines, we’ve been able to find the average heart rate and temperature measured among all subjects in all visits each year. This syntax is actually a short cut to the GroupBy functionality, which we will discuss in Aggregation and Grouping [prossimamente]. While this is a toy example, many real-world datasets have similar hierarchical structure.

Inoltre: Panel Data
Pandas has a few other fundamental data structures that we have not yet discussed, namely the pd.Panel and pd.Panel4D objects. These can be thought of, respectively, as three-dimensional and four-dimensional generalizations of the (one-dimensional) Series and (two-dimensional) DataFrame structures. Once you are familiar with indexing and manipulation of data in a Series and DataFrame, Panel and Panel4D are relatively straightforward to use. In particular, the ix, loc, and iloc indexers discussed in Data Indexing and Selection [qui] extend readily to these higher-dimensional structures.

We won’t cover these panel structures further in this text, as I’ve found in the majority of cases that multi-indexing is a more useful and conceptually simpler representation for higher-dimensional data. Additionally, panel data is fundamentally a dense data representation, while multi-indexing is fundamentally a sparse data representation. As the number of dimensions increases, the dense representation can become very inefficient for the majority of real-world datasets. For the occasional specialized application, however, these structures can be useful. If you’d like to read more about the Panel and Panel4D structures, see the references listed in Further Resources.


NumPy – 46 – indicizzazione gerarchica – 3

Continuo da qui, copio qui.

Indicizzazione e suddivisione (slicing) di MultiIndex
Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We’ll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.

Series multi-indicizzate
Consider the multiply indexed Series of state populations we saw earlier:

We can access single elements by indexing with multiple terms:

The MultiIndex also supports partial indexing, or indexing just one of the levels in the index. The result is another Series, with the lower-level indices maintained:

Partial slicing is available as well, as long as the MultiIndex is sorted:

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

Other types of indexing and selection (discussed in Data Indexing and Selection [qui]) work as well; for example, selection based on Boolean masks:

Selection based on fancy indexing also works:

Dataframes multi-indicizzati
A multiply indexed Dataframe behaves in a similar manner. Consider our toy medical Dataframe from before:

Remember that columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido’s heart rate data with a simple operation:

Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection [stesso link di prima]. For example:

These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:

Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:

You could get around this by building the desired slice explicitly using Python’s built-in slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation. For example:

There are so many ways to interact with data in multiply indexed Series and DataFrames, and as with many tools in this book the best way to become familiar with them is to try them out!