Category Archives: Python

NumPy – 37 – introduzione agli oggetti Pandas – 3

Continuo da qui copiando qui.

L’oggetto Index di Pandas
We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let’s construct an Index from a list of integers:

Index come array immutabile
The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices:

Index objects also have many of the attributes familiar from NumPy arrays:

One difference between Index objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.

Index come set ordinato
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

These operations may also be accessed via object methods, for example


NumPy – 36 – introduzione agli oggetti Pandas – 2

Continuo da qui copiando qui.

L’oggetto DataFrame di Pandas
The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We’ll now take a look at each of these perspectives.

DataFrames come array di NumPy generalizzato
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by “aligned” we mean that they share the same index.

To demonstrate this, let’s first construct a new Series listing the area of each of the five states discussed in the previous section:

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

DataFrame come dictionary specializzato
Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the ‘area’ attribute returns the Series object containing the areas we saw earlier:

Notice the potential point of confusion here: in a two-dimesnional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will return the first column. Because of this, it is probably better to think about DataFrames as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful. We’ll explore more flexible means of indexing DataFrames in Data Indexing and Selection.

Costruire oggetti DataFrame
A Pandas DataFrame can be constructed in a variety of ways. Here we’ll give several examples.

da un singolo oggetto Series
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

da una lista di dictionaries
Any list of dictionaries can be made into a DataFrame. We’ll use a simple list comprehension to create some data:

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., “not a number”) values:

da un dictionary di oggetti Series
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

da un array NumPy bidimensionale
Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:

da un array NumPy strutturato
We covered structured arrays in Structured Data: NumPy’s Structured Arrays. A Pandas DataFrame operates much like a structured array, and can be created directly from one:


NumPy – 35 – introduzione agli oggetti Pandas – 1

Continuo da qui iniziando Pandas, copio qui.

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let’s introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.

We will start our code sessions with the standard NumPy and Pandas imports:

Gli oggetti Series
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:

The index is an array-like object of type pd.Index, which we’ll discuss in more detail momentarily.

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

As we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.

Series come NumPy array generalizzato
From what we’ve seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index, and the item access works as expected:

We can even use non-contiguous or non-sequential indices:

Series come dictionary specializzato
In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

By default, a Series will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be performed:

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

We’ll discuss some of the quirks of Pandas indexing and slicing in Data Indexing and Selection.

Costruire oggetti Series
We’ve already seen a few ways of constructing a Pandas Series from scratch; all of them are some version of the following:

pd.Series(data, index=index)

where index is an optional argument, and data can be one of many entities.

For example, data can be a list or NumPy array, in which case index defaults to an integer sequence:

data can be a scalar, which is repeated to fill the specified index:

data can be a dictionary, in which index defaults to the sorted dictionary keys:

In each case, the index can be explicitly set if a different result is preferred:

Notice that in this case, the Series is populated only with the explicitly identified keys.


NumPy – 34 – manipolare dati con Pandas

Continuo da qui iniziando a copiare da un capitolo nuovo, qui.

Finora abbiamo visto gli oggetti di ndarray forniti da NumPy. Ma c’è il package Pandas: Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

As we saw, NumPy’s ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

We will focus on the mechanics of using Series, DataFrame, and related structures effectively. We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

Installare Pandas
Basta seguire le istruzioni della documentazione.
Once Pandas is installed, you can import it and check the version:

Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd:

Nota sulla documentazione
IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ? character). (Refer back to Help and Documentation in IPython if you need a refresher on this.)
For example, to display all the contents of the pandas namespace, you can type pd.<TAB>:

Questo è solo l’inizio, lunghissimissimo l’elenco.
And to display Pandas’s built-in documentation, you can use this:

More detailed documentation, along with tutorials and other resources, can be found here.


NumPy – 33 – dati strutturati – arrays strutturati di NumPy – 2

Continuo da qui, copio qui.

Creare arrays strutturati
Structured array data types can be specified in a number of ways. Earlier, we saw the dictionary method:

For clarity, numerical types can be specified using Python types or NumPy dtypes instead:

A compound type can also be specified as a list of tuples:

If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:

The shortened string format codes may seem confusing, but they are built on simple principles. The first (optional) character is < or >, which means “little endian” or “big endian,” respectively, and specifies the ordering convention for significant bits. The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below). The last character or characters represents the size of the object in bytes.

Character Description            Example
'b'       Byte                   np.dtype('b')
'i'       Signed integer         np.dtype('i4') == np.int32
'u'       Unsigned integer       np.dtype('u1') == np.uint8
'f'       Floating point         np.dtype('f8') == np.int64
'c'       Complex floating point np.dtype('c16') == np.complex128
'S', 'a'  String                 np.dtype('S5')
'U'       Unicode string         np.dtype('U') == np.str_
'V'       Raw data (void)        np.dtype('V') == np.void

Ancora sui tipi composti avanzati
It is possible to define even more advanced compound types. For example, you can create a type where each element contains an array or matrix of values. Here, we’ll create a data type with a mat component consisting of a 3×3 floating-point matrix:

Now each element in the X array consists of an id and a 3×3 matrix. Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary? The reason is that this NumPy dtype directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program. If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you’ll probably find structured arrays quite useful!

RecordArrays: arrays strutturati con il turbo
NumPy also provides the np.recarray class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys. Recall that we previously accessed the ages by writing:

ho dovuto ricostruire l’array, ovviamente 😉

If we view our data as a record array instead, we can access this with slightly fewer keystrokes:

The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:

Whether the more convenient notation is worth the additional overhead will depend on your own application.

Ma c’è Pandas
This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas. Structured arrays like the ones discussed here are good to know about for certain situations, especially in case you’re using NumPy arrays to map onto binary data formats in C, Fortran, or another language. For day-to-day use of structured data, the Pandas package is a much better choice, and we’ll dive into a full discussion of it in the chapter that follows.

OK Jake; aspettiamo Pandas 😀


NumPy – 32 – dati strutturati – arrays strutturati di NumPy – 1

Continuo da qui a copiare qui.

While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas Dataframes, which we’ll explore [prossimamete].

Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d like to store these values for use in a Python program. It would be possible to store these in three separate arrays:

But this is a bit clumsy. There’s nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data. NumPy can handle this through structured arrays, which are arrays with compound data types.

Recall that previously we created a simple array using an expression like this:

We can similarly create a structured array using a compound data type specification:

Here ‘U10‘ translates to “Unicode string of maximum length 10,” ‘i4‘ translates to “4-byte (i.e., 32 bit) integer,” and ‘f8‘ translates to “8-byte (i.e., 64 bit) float.” We’ll discuss other options for these type codes in the following section.

Now that we’ve created an empty container array, we can fill the array with our lists of values:

As we had hoped, the data is now arranged together in one convenient block of memory.

The handy thing with structured arrays is that you can now refer to values either by index or by name:

Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:

Nota per me: notare la specificazione di data, 2 volte non come mi verrebbe da pensare, vengo dal Fortran 😜

Note that if you’d like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered [prossimamente]. As we’ll see, Pandas provides a Dataframe object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we’ve shown here, as well as much, much more.


NumPy – 31 – La notazione Big-O

Continuo da qui; oggi un argomento di cultura generale, qui.

Big-O notation is a means of describing how the number of operations required for an algorithm scales as the input grows in size. To use it correctly is to dive deeply into the realm of computer science theory, and to carefully distinguish it from the related small-o notation, big-θ notation, big-Ω notation, and probably many mutant hybrids thereof. While these distinctions add precision to statements about algorithmic scaling, outside computer science theory exams and the remarks of pedantic blog commenters, you’ll rarely see such distinctions made in practice. Far more common in the data science world is a less rigid use of big-O notation: as a general (if imprecise) description of the scaling of an algorithm. With apologies to theorists and pedants, this is the interpretation we’ll use throughout this book.

Big-O notation, in this loose sense, tells you how much time your algorithm will take as you increase the amount of data. If you have an O[N] (read “order N”) algorithm that takes 1 second to operate on a list of length N=1,000, then you should expect it to take roughly 5 seconds for a list of length N=5,000. If you have an O[N2] (read “order N squared”) algorithm that takes 1 second for N=1000, then you should expect it to take about 25 seconds for N=5000.

For our purposes, the N will usually indicate some aspect of the size of the dataset (the number of points, the number of dimensions, etc.). When trying to analyze billions or trillions of samples, the difference between O[N] and O[N2] can be far from trivial!

Notice that the big-O notation by itself tells you nothing about the actual wall-clock time of a computation, but only about its scaling as you change N. Generally, for example, an O[N] algorithm is considered to have better scaling than an O[N2] algorithm, and for good reason. But for small datasets in particular, the algorithm with better scaling might not be faster. For example, in a given problem an O[N2] algorithm might take 0.01 seconds, while a “better” O[N] algorithm might take 1 second. Scale up N by a factor of 1,000, though, and the O[N] algorithm will win out.

Even this loose version of Big-O notation can be very useful when comparing the performance of algorithms, and we’ll use this notation throughout the book when talking about how algorithms scale.


NumPy – 30 – sort di arrays – 2

Continuo da qui copiando qui.

Oprdinamento parziale, partizionamento
Sometimes we’re not interested in sorting the entire array, but simply want to find the k smallest values in the array. NumPy provides this in the np.partition function. np.partition takes an array and a number K; the result is a new array with the smallest K values to the left of the partition, and the remaining values to the right, in arbitrary order:

Note that the first three values in the resulting array are the three smallest in the array, and the remaining array positions contain the remaining values. Within the two partitions, the elements have arbitrary order.

Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array:

The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots.

Finally, just as there is a np.argsort that computes indices of the sort, there is a np.argpartition that computes indices of the partition. We’ll see this in action in the following section.

Esempio: i k-prossimi vicini
Let’s quickly see how we might use this argsort function along multiple axes to find the nearest neighbors of each point in a set. We’ll start by creating a random set of 10 points on a two-dimensional plane. Using the standard convention, we’ll arrange these in a 10×2 array: X = rand.rand(10, 2). To get an idea of how these points look, let’s quickly scatter plot them (

import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling

X = np.random.rand(10, 2)
plt.scatter(X[:, 0], X[:, 1], s=100);

Now we’ll compute the distance between each pair of points. Recall that the squared-distance between two points is the sum of the squared differences in each dimension; using the efficient broadcasting (Computation on Arrays: Broadcasting [qui]) and aggregation (Aggregations: Min, Max, and Everything In Between [qui]) routines provided by NumPy we can compute the matrix of square distances in a single line of code:

This operation has a lot packed into it, and it might be a bit confusing if you’re unfamiliar with NumPy’s broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:

Just to double-check what we are doing, we should see that the diagonal of this matrix (i.e., the set of distances between each point and itself) is all zero:

It checks out! With the pairwise square-distances converted, we can now use np.argsort to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:

Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point’s closest neighbor is itself, as we would expect.

By using a full sort here, we’ve actually done more work than we need to in this case. If we’re simply interested in the nearest k neighbors, all we need is to partition each row so that the smallest k+1 squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the np.argpartition function:

In order to visualize this network of neighbors, let’s quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:

Each point in the plot has lines drawn to its two nearest neighbors. At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A.

Although the broadcasting and row-wise sorting of this approach might seem less straightforward than writing a loop, it turns out to be a very efficient way of operating on this data in Python. You might be tempted to do the same type of operation by manually looping through the data and sorting each set of neighbors individually, but this would almost certainly lead to a slower algorithm than the vectorized version we used. The beauty of this approach is that it’s written in a way that’s agnostic to the size of the input data: we could just as easily compute the neighbors among 100 or 1,000,000 points in any number of dimensions, and the code would look the same.

Finally, I’ll note that when doing very large nearest neighbor searches, there are tree-based and/or approximate algorithms that can scale as O[NlogN] or better rather than the O[N2] of the brute-force algorithm. One example of this is the KD-Tree, implemented in Scikit-learn.


NumPy – 29 – sort di arrays – 1

Continuando da qui inizio un nuovo capitolo, qui.

Ordinare arrays
Up to this point we have been concerned mainly with tools to access and operate on array data with NumPy. This section covers algorithms related to sorting values in NumPy arrays. These algorithms are a favorite topic in introductory computer science courses: if you’ve ever taken one, you probably have had dreams (or, depending on your temperament, nightmares) about insertion sorts, selection sorts, merge sorts, quick sorts, bubble sorts, and many, many more. All are means of accomplishing a similar task: sorting the values in a list or array.

For example, a simple selection sort repeatedly finds the minimum value from a list, and makes swaps until the list is sorted. We can code this in just a few lines of Python:

As any first-year computer science major will tell you, the selection sort is useful for its simplicity, but is much too slow to be useful for larger arrays. For a list of N values, it requires N loops, each of which does on order ∼N comparisons to find the swap value. In terms of the “big-O” notation often used to characterize these algorithms (see Big-O Notation [prossimamente]), selection sort averages O[N2]: if you double the number of items in the list, the execution time will go up by about a factor of four.

Even selection sort, though, is much better than my all-time favorite sorting algorithms, the bogosort:

This silly sorting method relies on pure chance: it repeatedly applies a random shuffling of the array until the result happens to be sorted. With an average scaling of O[N×N!], (that’s N times N factorial) this should–quite obviously–never be used for any real computation.

Fortunately, Python contains built-in sorting algorithms that are much more efficient than either of the simplistic algorithms just shown. We’ll start by looking at the Python built-ins, and then take a look at the routines included in NumPy and optimized for NumPy arrays.

Ordinamenti veloci in NumPy: np.sort e np.argsort
Although Python has built-in sort and sorted functions to work with lists, we won’t discuss them here because NumPy’s np.sort function turns out to be much more efficient and useful for our purposes. By default np.sort uses an O[NlogN], quicksort algorithm, though mergesort and heapsort are also available. For most applications, the default quicksort is more than sufficient.

To return a sorted version of the array without modifying the input, you can use np.sort:

If you prefer to sort the array in-place, you can instead use the sort method of arrays:

A related function is argsort, which instead returns the indices of the sorted elements:

The first element of this result gives the index of the smallest element, the second value gives the index of the second smallest, and so on. These indices can then be used (via fancy indexing) to construct the sorted array if desired:

Ordinare per righe o colonne
A useful feature of NumPy’s sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the axis argument. For example:

Keep in mind that this treats each row or column as an independent array, and any relationships between the row or column values will be lost!

Pausa 😊 poi continerà.


NumPy – 28 – indicizzazione fancy – 3

Continuo copiando qui.

Modificare valori con l’indirizzamento fancy
Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array. For example, imagine we have an array of indices and we’d like to set the corresponding items in an array to some value:

We can use any assignment-type operator for this. For example:

Notice, though, that repeated indices with these operations can cause some potentially unexpected results. Consider the following:

Where did the 4 go? The result of this operation is to first assign x[0] = 4, followed by x[0] = 6. The result, of course, is that x[0] contains the value 6.

Fair enough, but consider this operation:

You might expect that x[3] would contain the value 2, and x[3] would contain the value 3, as this is how many times each index is repeated. Why is this not the case? Conceptually, this is because x[i] += 1 is meant as a shorthand of x[i] = x[i] + 1. x[i] + 1 is evaluated, and then the result is assigned to the indices in x. With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results.

So what if you want the other behavior where the operation is repeated? For this, you can use the at() method of ufuncs (available since NumPy 1.8), and do the following:

Esempio: suddividere dati
You can use these ideas to efficiently bin data to create a histogram by hand. For example, imagine we have 1,000 values and would like to quickly find where they fall within an array of bins. We could compute it using like this:

import numpy as np

x = np.random.randn(100)

# compute a histogram by hand
bins = np.linspace(-5, 5, 20)
counts = np.zeros_like(bins)

# find the appropriate bin for each x
i = np.searchsorted(bins, x)

# add 1 to each of these bins, i, 1)

# plot the results
import matplotlib.pyplot as plt
plt.plot(bins, counts, linestyle='steps');

Of course, it would be silly to have to do this each time you want to plot a histogram. This is why Matplotlib provides the plt.hist() routine, which does the same in a single line:

plt.hist(x, bins, histtype='step')

This function will create a nearly identical plot to the one seen here. To compute the binning, matplotlib uses the np.histogram function, which does a very similar computation to what we did before. Let’s compare the two here:

Our own one-line algorithm is several times faster than the optimized algorithm in NumPy! How can this be? If you dig into the np.histogram source code (you can do this in IPython by typing np.histogram??), you’ll see that it’s quite a bit more involved than the simple search-and-count that we’ve done; this is because NumPy’s algorithm is more flexible, and particularly is designed for better performance when the number of data points becomes large:

What this comparison shows is that algorithmic efficiency is almost never a simple question. An algorithm efficient for large datasets will not always be the best choice for small datasets, and vice versa. But the advantage of coding this algorithm yourself is that with an understanding of these basic methods, you could use these building blocks to extend this to do some very interesting custom behaviors. The key to efficiently using Python in data-intensive applications is knowing about general convenience routines like np.histogram and when they’re appropriate, but also knowing how to make use of lower-level functionality when you need more pointed behavior.