Category Archives: Python

NumPy – 66 – alte prestazioni di Pandas: eval() e query() – 2

Continuo da qui, copio qui.

DataFrame.eval() per operazioni sulle colonne
Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method that works in similar ways. The benefit of the eval() method is that columns can be referred to by name. We’ll use this labeled array as an example:

Using pd.eval() as above, we can compute expressions with the three columns like this:

The DataFrame.eval() method allows much more succinct evaluation of expressions with the columns:

Notice here that we treat column names as variables within the evaluated expression, and the result is what we would wish.

Assegnazioni in DataFrame.eval()
In addition to the options just discussed, DataFrame.eval() also allows assignment to any column. Let’s use the DataFrame from before, which has columns ‘A‘, ‘B‘, and ‘C‘:

We can use df.eval() to create a new column ‘D‘ and assign to it a value computed from the other columns:

In the same way, any existing column can be modified:

Variabili locali in DataFrame.eval()
The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables. Consider the following:

The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two “namespaces”: the namespace of columns, and the namespace of Python objects. Notice that this @ character is only supported by the DataFrame.eval() method, not by the pandas.eval() function, because the pandas.eval() function only has access to the one (Python) namespace.

Metodi di DataFrame.query()
The DataFrame has another method based on evaluated strings, called the query() method. Consider the following:

As with the example used in our discussion of DataFrame.eval(), this is an expression involving columns of the DataFrame. It cannot be expressed using the DataFrame.eval() syntax, however! Instead, for this type of filtering operation, you can use the query() method:

In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand. Note that the query() method also accepts the @ flag to mark local variables:

Performance: quando usare queste funzioni
When considering whether to use these functions, there are two considerations: computation time and memory use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas DataFrames will result in implicit creation of temporary arrays: For example, this:

Is roughly equivalent to this:

If the size of the temporary DataFrames is significant compared to your available system memory (typically several gigabytes) then it’s a good idea to use an eval() or query() expression. You can check the approximate size of your array in bytes using this:

On the performance side, eval() can be faster even when you are not maxing-out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then eval() can avoid some potentially slow movement of values between the different memory caches. In practice, I find that the difference in computation time between the traditional methods and the eval/query method is usually not significant–if anything, the traditional method is faster for smaller arrays! The benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer.

We’ve covered most of the details of eval() and query() here; for more information on these, you can refer to the Pandas documentation. In particular, different parsers and engines can be specified for running these queries; for details on this, see the discussion within the “Enhancing Performance” section.


NumPy – 65 – alte prestazioni di Pandas: eval() e query() – 1

Continuo da qui, copio qui.

As we’ve already seen in previous sections, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into C via an intuitive syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas. While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.

As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the eval() and query() functions, which rely on the Numexpr package. In this notebook we will walk through their use and give some rules-of-thumb about when you might think about using them.

Perché query() e eval(): espressioni composte
We’ve seen previously that NumPy and Pandas support fast vectorized operations; for example, when adding the elements of two arrays:

As discussed in Computation on NumPy Arrays: Universal Functions [qui], this is much faster than doing the addition via a Python loop or comprehension:

But this abstraction can become less efficient when computing compound expressions. For example, consider the following expression:

mask = (x > 0.5) & (y < 0.5)

Because NumPy evaluates each subexpression, this is roughly equivalent to the following:

tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2

In other words, every intermediate step is explicitly allocated in memory. If the x and y arrays are very large, this can lead to significant memory and computational overhead. The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays. The Numexpr documentation has more details, but for the time being it is sufficient to say that the library accepts a string giving the NumPy-style expression you’d like to compute:

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays. The Pandas eval() and query() tools that we will discuss here are conceptually similar, and depend on the Numexpr package.

pandas.eval() per operazioni efficienti
The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrames. For example, consider the following DataFrames:

To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum:

The same result can be computed via pd.eval by constructing the expression as a string:

The eval() version of this expression is about 50% faster (and uses much less memory), while giving the same result:

Operazioni supportate da pd.eval()
As of Pandas v0.16, pd.eval() supports a wide range of operations. To demonstrate these, we’ll use the following integer DataFrames:

operatori aritmetici
pd.eval() supports all arithmetic operators. For example:

operatori di comparazione
pd.eval() supports all comparison operators, including chained expressions:

operatori bitwise
pd.eval() supports the & and | bitwise operators:

In addition, it supports the use of the literal and and or in Boolean expressions:

attributi di oggetti e indici
pd.eval() supports access to object attributes via the obj.attr syntax, and indexes via the obj[index] syntax:

altre operazioni
Other operations such as function calls, conditional statements, loops, and other more involved constructs are currently not implemented in pd.eval(). If you’d like to execute these more complicated types of expressions, you can use the Numexpr library itself.


NumPy – 64 – lavorare con Series temporali – 5

Continuo da qui, copio qui.

Esempio: Seattle in bicicletta
As a more involved example of working with some time series data, let’s take a look at bicycle counts on Seattle’s Fremont Bridge. This data comes from an automated bicycle counter, installed in late 2012, which has inductive sensors on the east and west sidewalks of the bridge. The hourly bicycle counts can be downloaded from; here is the direct link to the dataset.

As of summer 2016, the CSV can be downloaded as follows:

Once this dataset is downloaded, we can use Pandas to read the CSV output into a DataFrame. We will specify that we want the Date as an index, and we want these dates to be automatically parsed:

For convenience, we’ll further process this dataset by shortening the column names and adding a “Total” column:

Now let’s take a look at the summary statistics for this data:

Visualizzare i dati
We can gain some insight into the dataset by visualizing it. Let’s start by plotting the raw data:

OK, un errore e un ; dimenticato ma ecco

The ~25,000 hourly samples are far too dense for us to make much sense of. We can gain more insight by resampling the data to a coarser grid. Let’s resample by week:


This shows us some interesting seasonal trends: as you might expect, people bicycle more in the summer than in the winter, and even within a particular season the bicycle use varies from week to week (likely dependent on weather; see In Depth: Linear Regression [prossimamente] where we explore this further).

Another way that comes in handy for aggregating the data is to use a rolling mean, utilizing the pd.rolling_mean() function. Here we’ll do a 30 day rolling mean of our data, making sure to center the window:


The jaggedness of the result is due to the hard cutoff of the window. We can get a smoother version of a rolling mean using a window function–for example, a Gaussian window. The following code specifies both the width of the window (we chose 50 days) and the width of the Gaussian within the window (we chose 10 days):


Investigare i dati
While these smoothed data views are useful to get an idea of the general trend in the data, they hide much of the interesting structure. For example, we might want to look at the average traffic as a function of the time of day. We can do this using the GroupBy functionality discussed in Aggregation and Grouping [qui]:


The hourly traffic is a strongly bimodal distribution, with peaks around 8:00 in the morning and 5:00 in the evening. This is likely evidence of a strong component of commuter traffic crossing the bridge. This is further evidenced by the differences between the western sidewalk (generally used going toward downtown Seattle), which peaks more strongly in the morning, and the eastern sidewalk (generally used going away from downtown Seattle), which peaks more strongly in the evening.

We also might be curious about how things change based on the day of the week. Again, we can do this with a simple groupby:


This shows a strong distinction between weekday and weekend totals, with around twice as many average riders crossing the bridge on Monday through Friday than on Saturday and Sunday.

With this in mind, let’s do a compound GroupBy and look at the hourly trend on weekdays versus weekends. We’ll start by grouping by both a flag marking the weekend, and the time of day:

Now we’ll use some of the Matplotlib tools described in Multiple Subplots [prossimamente] to plot two panels side by side:


The result is very interesting: we see a bimodal commute pattern during the work week, and a unimodal recreational pattern during the weekends. It would be interesting to dig through this data in more detail, and examine the effect of weather, temperature, time of year, and other factors on people’s commuting patterns; for further discussion, see my blog post “Is Seattle Really Seeing an Uptick In Cycling?“, which uses a subset of this data. We will also revisit this dataset in the context of modeling in In Depth: Linear Regression [prossimamente].


NumPy – 63 – lavorare con Series temporali – 4

Continuo da qui, copio qui.

Riorganizzare, spostare e visualizzare
The ability to use dates and times as indices to intuitively organize and access data is an important piece of the Pandas time series tools. The benefits of indexed data in general (automatic alignment during operations, intuitive data slicing and access, etc.) still apply, and Pandas provides several additional time series-specific operations.

We will take a look at a few of those here, using some stock price data as an example. Because Pandas was developed largely in a finance context, it includes some very specific tools for financial data. For example, the accompanying pandas-datareader package (installable via conda install pandas-datareader), knows how to import financial data from a number of available sources, including Yahoo finance, Google Finance, and others. Here we will load Google’s closing price history:

Uh! installo:

ed ecco:

For simplicity, we’ll use just the closing price:

We can visualize this using the plot() method, after the normal Matplotlib setup boilerplate [prossimamente]:


Riorganizzare e convertire le frequenze
One common need for time series data is resampling at a higher or lower frequency. This can be done using the resample() method, or the much simpler asfreq() method. The primary difference between the two is that resample() is fundamentally a data aggregation, while asfreq() is fundamentally a data selection.

Taking a look at the Google closing price, let’s compare what the two return when we down-sample the data. Here we will resample the data at the end of business year:


Notice the difference: at each point, resample reports the average of the previous year, while asfreq reports the value at the end of the year.

For up-sampling, resample() and asfreq() are largely equivalent, though resample has many more options available. In this case, the default for both methods is to leave the up-sampled points empty, that is, filled with NA values. Just as with the pd.fillna() function discussed previously, asfreq() accepts a method argument to specify how values are imputed. Here, we will resample the business day data at a daily frequency (i.e., including weekends):


The top panel is the default: non-business days are left as NA values and do not appear on the plot. The bottom panel shows the differences between two strategies for filling the gaps: forward-filling and backward-filling.

Spostamento del tempo
Another common time series-specific operation is shifting of data in time. Pandas has two closely related methods for computing this: shift() and tshift() In short, the difference between them is that shift() shifts the data, while tshift() shifts the index. In both cases, the shift is specified in multiples of the frequency.

Here we will both shift() and tshift() by 900 days;

Nella REPL di IPython posso controllare passo-passo mentre apprendo, cosa che viene meno facendo uno script; certo che questo screenshot è abnorme.

Sì, dimenticato un import ma rimediato al volo 😜

We see here that shift(900) shifts the data by 900 days, pushing some of it off the end of the graph (and leaving NA values at the other end), while tshift(900) shifts the index values by 900 days.

A common context for this type of shift is in computing differences over time. For example, we use shifted values to compute the one-year return on investment for Google stock over the course of the dataset:


This helps us to see the overall trend in Google stock: thus far, the most profitable times to invest in Google have been (unsurprisingly, in retrospect) shortly after its IPO, and in the middle of the 2009 recession.

Rolling windows
yep, non so tradurlo
Rolling statistics are a third type of time series-specific operation implemented by Pandas. These can be accomplished via the rolling() attribute of Series and DataFrame objects, which returns a view similar to what we saw with the groupby operation (see Aggregation and Grouping [qui]). This rolling view makes available a number of aggregation operations by default.

For example, here is the one-year centered rolling mean and standard deviation of the Google stock prices:


As with group-by operations, the aggregate() and apply() methods can be used for custom rolling computations.

Per saperne di più
This section has provided only a brief summary of some of the most essential features of time series tools provided by Pandas; for a more complete discussion, you can refer to the “Time Series/Date” section of the Pandas online documentation.

Another excellent resource is the textbook Python for Data Analysis by Wes McKinney (OReilly, 2012). Although it is now a few years old, it is an invaluable resource on the use of Pandas. In particular, this book emphasizes time series tools in the context of business and finance, and focuses much more on particular details of business calendars, time zones, and related topics.

As always, you can also use the IPython help functionality to explore and try further options available to the functions and methods discussed here. I find this often is the best way to learn a new Python tool.

Non so voi ma per me Pandas & Jake rockzs! 🚀 assay 💥


NumPy – 62 – lavorare con Series temporali – 3

Continuo da qui, copio qui.

Frequenze e offsets
Fundamental to these Pandas time series tools is the concept of a frequency or date offset. Just as we saw the D (day) and H (hour) codes above [post pecedente], we can use such codes to specify any desired frequency spacing. The following table summarizes the main codes available:

Code Description
D    Calendar day
W    Weekly   
M    Month end  
Q    Quarter end  
A    Year end  
H    Hours  
T    Minutes   
S    Seconds   
L    Milliseonds   
U    Microseconds   
N    nanoseconds   
B    Business day
BM   Business month end
BQ   Business quarter end
BA   Business year end
BH   Business hours

The monthly, quarterly, and annual frequencies are all marked at the end of the specified period. By adding an S suffix to any of these, they instead will be marked at the beginning:

Code Description
MS   Month start   
QS   Quarter start
AS   Year start   
BMS  Business month start
BQS  Business quarter start
BAS  Business year start

Additionally, you can change the month used to mark any quarterly or annual code by adding a three-letter month code as a suffix:



In the same way, the split-point of the weekly frequency can be modified by adding a three-letter weekday code:

W-SUN, W-MON, W-TUE, W-WED, etc.

On top of this, codes can be combined with numbers to specify other frequencies. For example, for a frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows:

All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the pd.tseries.offsets module. For example, we can create a business day offset directly as follows:

For more discussion of the use of frequencies and offsets, see the “DateOffset” section of the Pandas documentation.


NumPy – 61 – lavorare con Series temporali – 2

Continuo da qui, copio qui.

Series Pandas per indicizzare con il tempo
Where the Pandas time series tools really become useful is when you begin to index data by timestamps. For example, we can construct a Series object that has time indexed data:

Now that we have this data in a Series, we can make use of any of the Series indexing patterns we discussed in previous sections, passing values that can be coerced into dates:

There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from that year:

Later, we will see additional examples of the convenience of dates-as-indices. But first, a closer look at the available time series data structures.

Strutture Pandas per Series di tempo
This section will introduce the fundamental Pandas data structures for working with time series data:

  • For time stamps, Pandas provides the Timestamp type. As mentioned before, it is essentially a replacement for Python’s native datetime, but is based on the more efficient numpy.datetime64 data type. The associated Index structure is DatetimeIndex.
  • For time Periods, Pandas provides the Period type. This encodes a fixed-frequency interval based on numpy.datetime64. The associated index structure is PeriodIndex.
  • For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a more efficient replacement for Python’s native datetime.timedelta type, and is based on numpy.timedelta64. The associated index structure is TimedeltaIndex.

The most fundamental of these date/time objects are the Timestamp and DatetimeIndex objects. While these class objects can be invoked directly, it is more common to use the pd.to_datetime() function, which can parse a wide variety of formats. Passing a single date to pd.to_datetime() yields a Timestamp; passing a series of dates by default yields a DatetimeIndex:

Nota: sono solo io o Jake non ha detto che ha importato datetime?

Any DatetimeIndex can be converted to a PeriodIndex with the to_period() function with the addition of a frequency code; here we’ll use ‘D’ to indicate daily frequency:

A TimedeltaIndex is created, for example, when a date is subtracted from another:

Sequenze regolari: pd.date_range()
To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose: pd.date_range() for timestamps, pd.period_range() for periods, and pd.timedelta_range() for time deltas. We’ve seen that Python’s range() and NumPy’s np.arange() turn a startpoint, endpoint, and optional stepsize into a sequence. Similarly, pd.date_range() accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates. By default, the frequency is one day:

Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods:

The spacing can be modified by altering the freq argument, which defaults to D. For example, here we will construct a range of hourly timestamps:

To create regular sequences of Period or Timedelta values, the very similar pd.period_range() and pd.timedelta_range() functions are useful. Here are some monthly periods:

And a sequence of durations increasing by an hour:

All of these require an understanding of Pandas frequency codes, which we’ll summarize in the next section.


NumPy – 59 – operazioni con stringhe vettorializzate – 2

Continuo da qui, copio qui.

Esempio: un database di ricette
These vectorized string operations become most useful in the process of cleaning up messy, real-world data. Here I’ll walk through an example of that, using an open recipe database compiled from various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

The scripts used to compile this can be found [on GitHub], and the link to the current version of the database is found there as well.

As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:

OOPS! vuoto; googlando l’ho trovato qua.

The database is in JSON format, so we will try pd.read_json to read it:

Oops! We get a ValueError mentioning that there is “trailing data.” Searching for the text of this error on the Internet, it seems that it’s due to using a file in which each line is itself a valid JSON, but the full file is not. Let’s check if this interpretation is true:

Yes, apparently each line is a valid JSON, so we’ll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with pd.read_json:

We see there are nearly 200,000 recipes, and 17 columns. Let’s take a look at one row to see what we have:

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. In particular, the ingredient list is in string format; we’re going to have to carefully extract the information we’re interested in. Let’s start by taking a closer look at the ingredients:

The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!

Just out of curiousity, let’s see which recipe has the longest ingredient list:

That certainly looks like an involved recipe.

We can do other aggregate explorations; for example, let’s see how many of the recipes are for breakfast food:

Or how many of the recipes list cinnamon as an ingredient:

This is the type of essential data exploration that is possible with Pandas string tools. It is data munging like this that Python really excels at.

Un semplice suggeritore di ricette
Let’s go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients. While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row. So we will cheat a bit: we’ll start with a list of common ingredients, and simply search to see whether they are in each recipe’s ingredient list. For simplicity, let’s just stick with herbs and spices for the time being:

We can then build a Boolean DataFrame consisting of True and False values, indicating whether this ingredient appears in the list:

Now, as an example, let’s say we’d like to find a recipe that uses parsley, paprika, and tarragon. We can compute this very quickly using the query() method of DataFrames, discussed in High-Performance Pandas: eval() and query() [prossimamente]:

We find only 10 recipes with this combination; let’s use the index returned by this selection to discover the names of the recipes that have this combination:

Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we’d like to cook for dinner.

Avanti con le ricette!
Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods. Of course, building a very robust recipe recommendation system would require a lot more work! Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process. This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.


NumPy – 58 – operazioni con stringhe vettorializzate – 1

Continuo da qui, copio qui.

One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we’ll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

Introuzione alle operazioni sulle stringhe di Pandas
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you’re stuck using a more verbose loop syntax:

This is perhaps sufficient to work with some data, but it will break if there are any missing values. For example:

e pensa te che mio nonno era di None (TO) 😜

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the str attribute of Pandas Series and Index objects containing strings. So, for example, suppose we create a Pandas Series with this data:

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

Using tab completion on this str attribute will list all the vectorized string methods available to Pandas

Tabella dei metodi di Pandas per le stringhe
If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it’s probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the following series of names:

metodi simili ai metodi di Python per le stringhe
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:

len()     lower()      translate()  islower()
ljust()   upper()      startswith() isupper()
rjust()   find()       endswith()   isnumeric()
center()  rfind()      isalnum()    isdecimal()
zfill()   index()      isalpha()    split()
strip()   rindex()     isdigit()    rsplit()
rstrip()  capitalize() isspace()    partition()
lstrip()  swapcase()   istitle()    rpartition()

Notice that these have various return values. Some, like lower(), return a series of strings:

But some others return numbers:

Or Boolean values:

Still others return lists or other compound values for each element:

We’ll see further manipulations of this kind of series-of-lists object as we continue our discussion.

metodi usanti espressioni regolari
In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module:

Method     Description
match()    Call re.match() on each element, returning a boolean.
extract()  Call re.match() on each element, returning matched groups as strings.
findall()  Call re.findall() on each element
replace()  Replace occurrences of pattern with some other string
contains() Call on each element, returning a boolean
count()    Count occurrences of pattern
split()    Equivalent to str.split(), but accepts regexps
rsplit()   Equivalent to str.rsplit(), but accepts regexps

With these, you can do a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (^) and end-of-string ($) regular expression characters:

The ability to concisely apply regular expressions across Series or Dataframe entries opens up many possibilities for analysis and cleaning of data.

metodi miscellanei
Finally, there are some miscellaneous methods that enable other convenient operations:

Method          Description
get()           Index each element
slice()         Slice each element
slice_replace() Replace slice in each element with passed value
cat()           Concatenate strings
repeat()        Repeat values
normalize()     Return Unicode form of string
pad()           Add whitespace to left, right, or both sides of strings
wrap()          Split long strings into lines with length less than a given width
join()          Join strings in each element of the Series with passed separator
get_dummies()   extract dummy variables as a dataframe

accedere e suddividere elementi vettorializzati
The get() and slice() operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note that this behavior is also available through Python’s normal indexing syntax–for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:

Indexing via df.str.get(i) and df.str[i] is likewise similar.

These get() and slice() methods also let you access elements of arrays returned by split(). For example, to extract the last name of each entry, we can combine split() and get():

variabili indicatrici
Another method that requires a bit of extra explanation is the get_dummies() method. This is useful when your data has a column containing some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A="born in America", B="born in the United Kingdom", C="likes cheese", D="likes spam":

The get_dummies() routine lets you quickly split-out these indicator variables into a DataFrame:

With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

We won’t dive further into these methods here, but I encourage you to read through “Working with Text Data” in the Pandas online documentation, or to refer to the resources listed in Further Resources [prossimamente].


NumPy – 57 – tabelle con pivot – 2

Continuo da qui, copio qui.

Esempio: data di nascita
As a more interesting example, let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found here  (this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, this blog post):

e poi…

We can start to understand this data a bit more by using a pivot table. Let’s add a decade column, and take a look at male and female births as a function of decade:

We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visualize the total number of births by year (see Introduction to Matplotlib for a discussion of plotting with Matplotlib [prossimamente]):

ed ecco:

With a simple pivot table and plot() method, we can immediately see the annual trend in births by gender. By eye, it appears that over the past 50 years male births have outnumbered female births by around 5%.

Ulteriori esplorazione dei dati
Though this doesn’t necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation:

This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution (You can learn more about sigma-clipping operations in a book I coauthored with Željko Ivezić, Andrew J. Connolly, and Alexander Gray: “Statistics, Data Mining, and Machine Learning in Astronomy” (Princeton University Press, 2014)).

With this we can use the query() method (discussed further in High-Performance Pandas: eval() and query()) [prossimamente] to filter-out rows with births outside these values:

Next we set the day column to integers; previously it had been a string because some columns in the dataset contained the value ‘null‘:

Finally, we can combine the day, month, and year to create a Date index (see Working with Time Series [prossimamente]). This allows us to quickly compute the weekday corresponding to each row:

Using this we can plot births by weekday for several decades:


Nota: dimenticato l’istruzione plt.ylabel('mean births by day').

Apparently births are slightly less common on weekends than on weekdays! Note that the 1990s and 2000s are missing because the CDC data contains only the month of birth starting in 1989.

Another intersting view is to plot the mean number of births by the day of the year. Let’s first group the data by month and day separately:

Focusing on the month and day only, we now have a time series reflecting the average number of births by date of the year. From this, we can use the plot method to plot the data. It reveals some interesting trends:

ed ecco:

In particular, the striking feature of this graph is the dip in birthrate on US holidays (e.g., Independence Day, Labor Day, Thanksgiving, Christmas, New Year’s Day) although this likely reflects trends in scheduled/induced births rather than some deep psychosomatic effect on natural births. For more discussion on this trend, see the analysis and links in [stesso link pecedente] on the subject. We’ll return to this figure in Example:-Effect-of-Holidays-on-US-Births [prossimamente], where we will use Matplotlib’s tools to annotate this plot.

Looking at this short example, you can see that many of the Python and Pandas tools we’ve seen to this point can be combined and used to gain insight from a variety of datasets. We will see some more sophisticated applications of these data manipulations in future sections!


NumPy – 56 – tabelle con pivot – 1

Continuo da qui, copio qui.

We have seen how the GroupBy abstraction lets us explore relationships within a dataset. A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation. That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.

Illustrare le tabelle pivot
For the examples in this section, we’ll use the database of passengers on the Titanic, available through the Seaborn library (see Visualization With Seaborn [prossimamente]):

This contains a wealth of information on each passenger of that ill-fated voyage, including gender, age, class, fare paid, and much more.

Tabelle pivot a mano
To start learning more about this data, we might begin by grouping according to gender, survival status, or some combination thereof. If you have read the previous section, you might be tempted to apply a GroupBy operation–for example, let’s look at survival rate by gender:

This immediately gives us some insight: overall, three of every four females on board survived, while only one in five males survived!

This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class. Using the vocabulary of GroupBy, we might proceed using something like this: we group by class and gender, select survival, apply a mean aggregate, combine the resulting groups, and then unstack the hierarchical index to reveal the hidden multidimensionality. In code:

This gives us a better idea of how both gender and class affected survival, but the code is starting to look a bit garbled. While each step of this pipeline makes sense in light of the tools we’ve previously discussed, the long string of code is not particularly easy to read or use. This two-dimensional GroupBy is common enough that Pandas includes a convenience routine, pivot_table, which succinctly handles this type of multi-dimensional aggregation.

Sintassi delle tabelle pivot
Here is the equivalent to the preceding operation using the pivot_table method of DataFrames:

This is eminently more readable than the groupby approach, and produces the same result. As you might expect of an early 20th-century transatlantic cruise, the survival gradient favors both women and higher classes. First-class women survived with near certainty (hi, Rose!), while only one in ten third-class men survived (sorry, Jack!).

Tabelle pivot multilivello
Just as in the GroupBy, the grouping in pivot tables can be specified with multiple levels, and via a number of options. For example, we might be interested in looking at age as a third dimension. We’ll bin the age using the pd.cut function:

We can apply the same strategy when working with the columns as well; let’s add info on the fare paid using pd.qcut to automatically compute quantiles:

The result is a four-dimensional aggregation with hierarchical indices (see Hierarchical Indexing [qui]), shown in a grid demonstrating the relationship between the values.

Opzioni addizionali per le tabelle pivot
The full call signature of the pivot_table method of DataFrames is as follows:

# call signature as of Pandas 0.18
DataFrame.pivot_table(data, values=None, index=None, 
                      columns=None, aggfunc='mean', 
                      fill_value=None, margins=False,
                      dropna=True, margins_name='All')

uhmmmm… mi da errore name ‘DataFrame’ is not defined.

We’ve already seen examples of the first three arguments; here we’ll take a quick look at the remaining ones. Two of the options, fill_value and dropna, have to do with missing data and are fairly straightforward; we will not show examples of them here.

The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices (e.g., ‘sum’, ‘mean’, ‘count’, ‘min’, ‘max’, etc.) or a function that implements an aggregation (e.g., np.sum(), min(), sum(), etc.). Additionally, it can be specified as a dictionary mapping a column to any of the above desired options:

Notice also here that we’ve omitted the values keyword; when specifying a mapping for aggfunc, this is determined automatically.

At times it’s useful to compute totals along each grouping. This can be done via the margins keyword:

Here this automatically gives us information about the class-agnostic survival rate by gender, the gender-agnostic survival rate by class, and the overall survival rate of 38%. The margin label can be specified with the margins_name keyword, which defaults to “All“.