Category Archives: NumPy

NumPy – 39 – indicizzazione e selezione dei dati – 2

Continuo copiando qui.

Selezione in DataFrame
Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

DataFrame come dictionary
The first analogy we will consider is the DataFrame as a dictionary of related Series objects. Let’s return to our example of areas and populations of states:

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

Equivalently, we can use attribute-style access with column names that are strings:

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

Though this is a useful shorthand, keep in mind that it does not work for all cases! For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible. For example, the DataFrame has a pop() method, so data.pop will point to this rather than the “pop” column:

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z rather than data.pop = z).

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

This shows a preview of the straightforward syntax of element-by-element arithmetic between Series objects; we’ll dig into this further [prossimamente].

DataFrame come array bidimensionale
As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:

When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:

and passing a single “index” to a DataFrame accesses a column:

Thus for array-style indexing, we need another convention. Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:

The ix indexer allows a hybrid of these two approaches:

Keep in mind that for integer indices, the ix indexer is subject to the same potential sources of confusion as discussed for integer-indexed Series objects.

Any of the familiar NumPy-style data access patterns can be used within these indexers. For example, in the loc indexer we can combine masking and fancy indexing as in the following:

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple DataFrame and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.

Ulteriori convenzioni di indicizzazione
There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:

Such slices can also refer to rows by number rather than by index:

Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.

:mrgreen:

NumPy – 38 – indicizzazione e selezione dei dati – 1

Continuando da qui oggi sono qui.

In passato we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. These included indexing (e.g., arr[2, 1]), slicing (e.g., arr[:, 1:5]), masking (e.g., arr[arr > 0]), fancy indexing (e.g., arr[0, [1, 5]]), and combinations thereof (e.g., arr[:, [1, 5]]). Here we’ll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects. If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.

We’ll start with the simple case of the one-dimensional Series object, and then move on to the more complicated two-dimesnional DataFrame object.

Selezione in Series
As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

Series come dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

Series come array a una dimensione
A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing.

Among these, slicing may be the source of the most confusion. Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

Indicizzatori: loc, iloc e ix
These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references the explicit index:

The iloc attribute allows indexing and slicing that always references the implicit Python-style index:

A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based indexing. The purpose of the ix indexer will become more apparent in the context of DataFrame objects, which we will discuss [nel prossimo post].

One guiding principle of Python code is that “explicit is better than implicit.” The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

:mrgreen:

NumPy – 37 – introduzione agli oggetti Pandas – 3

Continuo da qui copiando qui.

L’oggetto Index di Pandas
We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let’s construct an Index from a list of integers:

Index come array immutabile
The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices:

Index objects also have many of the attributes familiar from NumPy arrays:

One difference between Index objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.

Index come set ordinato
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

These operations may also be accessed via object methods, for example

:mrgreen:

NumPy – 36 – introduzione agli oggetti Pandas – 2

Continuo da qui copiando qui.

L’oggetto DataFrame di Pandas
The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We’ll now take a look at each of these perspectives.

DataFrames come array di NumPy generalizzato
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by “aligned” we mean that they share the same index.

To demonstrate this, let’s first construct a new Series listing the area of each of the five states discussed in the previous section:

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

DataFrame come dictionary specializzato
Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the ‘area’ attribute returns the Series object containing the areas we saw earlier:

Notice the potential point of confusion here: in a two-dimesnional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will return the first column. Because of this, it is probably better to think about DataFrames as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful. We’ll explore more flexible means of indexing DataFrames in Data Indexing and Selection.

Costruire oggetti DataFrame
A Pandas DataFrame can be constructed in a variety of ways. Here we’ll give several examples.

da un singolo oggetto Series
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

da una lista di dictionaries
Any list of dictionaries can be made into a DataFrame. We’ll use a simple list comprehension to create some data:

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., “not a number”) values:

da un dictionary di oggetti Series
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

da un array NumPy bidimensionale
Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:

da un array NumPy strutturato
We covered structured arrays in Structured Data: NumPy’s Structured Arrays. A Pandas DataFrame operates much like a structured array, and can be created directly from one:

:mrgreen:

NumPy – 35 – introduzione agli oggetti Pandas – 1

Continuo da qui iniziando Pandas, copio qui.

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let’s introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.

We will start our code sessions with the standard NumPy and Pandas imports:

Gli oggetti Series
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:

The index is an array-like object of type pd.Index, which we’ll discuss in more detail momentarily.

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

As we will see, though, the Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.

Series come NumPy array generalizzato
From what we’ve seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index, and the item access works as expected:

We can even use non-contiguous or non-sequential indices:

Series come dictionary specializzato
In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

By default, a Series will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be performed:

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

We’ll discuss some of the quirks of Pandas indexing and slicing in Data Indexing and Selection.

Costruire oggetti Series
We’ve already seen a few ways of constructing a Pandas Series from scratch; all of them are some version of the following:

pd.Series(data, index=index)

where index is an optional argument, and data can be one of many entities.

For example, data can be a list or NumPy array, in which case index defaults to an integer sequence:

data can be a scalar, which is repeated to fill the specified index:

data can be a dictionary, in which index defaults to the sorted dictionary keys:

In each case, the index can be explicitly set if a different result is preferred:

Notice that in this case, the Series is populated only with the explicitly identified keys.

:mrgreen:

NumPy – 34 – manipolare dati con Pandas

Continuo da qui iniziando a copiare da un capitolo nuovo, qui.

Finora abbiamo visto gli oggetti di ndarray forniti da NumPy. Ma c’è il package Pandas: Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

As we saw, NumPy’s ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

We will focus on the mechanics of using Series, DataFrame, and related structures effectively. We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

Installare Pandas
Basta seguire le istruzioni della documentazione.
Once Pandas is installed, you can import it and check the version:

Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd:

Nota sulla documentazione
IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ? character). (Refer back to Help and Documentation in IPython if you need a refresher on this.)
For example, to display all the contents of the pandas namespace, you can type pd.<TAB>:

Questo è solo l’inizio, lunghissimissimo l’elenco.
And to display Pandas’s built-in documentation, you can use this:

More detailed documentation, along with tutorials and other resources, can be found here.

:mrgreen:

NumPy – 33 – dati strutturati – arrays strutturati di NumPy – 2

Continuo da qui, copio qui.

Creare arrays strutturati
Structured array data types can be specified in a number of ways. Earlier, we saw the dictionary method:

For clarity, numerical types can be specified using Python types or NumPy dtypes instead:

A compound type can also be specified as a list of tuples:

If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:

The shortened string format codes may seem confusing, but they are built on simple principles. The first (optional) character is < or >, which means “little endian” or “big endian,” respectively, and specifies the ordering convention for significant bits. The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below). The last character or characters represents the size of the object in bytes.

Character Description            Example
'b'       Byte                   np.dtype('b')
'i'       Signed integer         np.dtype('i4') == np.int32
'u'       Unsigned integer       np.dtype('u1') == np.uint8
'f'       Floating point         np.dtype('f8') == np.int64
'c'       Complex floating point np.dtype('c16') == np.complex128
'S', 'a'  String                 np.dtype('S5')
'U'       Unicode string         np.dtype('U') == np.str_
'V'       Raw data (void)        np.dtype('V') == np.void

Ancora sui tipi composti avanzati
It is possible to define even more advanced compound types. For example, you can create a type where each element contains an array or matrix of values. Here, we’ll create a data type with a mat component consisting of a 3×3 floating-point matrix:

Now each element in the X array consists of an id and a 3×3 matrix. Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary? The reason is that this NumPy dtype directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program. If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you’ll probably find structured arrays quite useful!

RecordArrays: arrays strutturati con il turbo
NumPy also provides the np.recarray class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys. Recall that we previously accessed the ages by writing:

ho dovuto ricostruire l’array, ovviamente 😉

If we view our data as a record array instead, we can access this with slightly fewer keystrokes:

The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:

Whether the more convenient notation is worth the additional overhead will depend on your own application.

Ma c’è Pandas
This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas. Structured arrays like the ones discussed here are good to know about for certain situations, especially in case you’re using NumPy arrays to map onto binary data formats in C, Fortran, or another language. For day-to-day use of structured data, the Pandas package is a much better choice, and we’ll dive into a full discussion of it in the chapter that follows.

OK Jake; aspettiamo Pandas 😀

:mrgreen:

NumPy – 32 – dati strutturati – arrays strutturati di NumPy – 1

Continuo da qui a copiare qui.

While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas Dataframes, which we’ll explore [prossimamete].

Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d like to store these values for use in a Python program. It would be possible to store these in three separate arrays:

But this is a bit clumsy. There’s nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data. NumPy can handle this through structured arrays, which are arrays with compound data types.

Recall that previously we created a simple array using an expression like this:

We can similarly create a structured array using a compound data type specification:

Here ‘U10‘ translates to “Unicode string of maximum length 10,” ‘i4‘ translates to “4-byte (i.e., 32 bit) integer,” and ‘f8‘ translates to “8-byte (i.e., 64 bit) float.” We’ll discuss other options for these type codes in the following section.

Now that we’ve created an empty container array, we can fill the array with our lists of values:

As we had hoped, the data is now arranged together in one convenient block of memory.

The handy thing with structured arrays is that you can now refer to values either by index or by name:

Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:

Nota per me: notare la specificazione di data, 2 volte non come mi verrebbe da pensare, vengo dal Fortran 😜

Note that if you’d like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered [prossimamente]. As we’ll see, Pandas provides a Dataframe object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we’ve shown here, as well as much, much more.

:mrgreen:

NumPy – 31 – La notazione Big-O

Continuo da qui; oggi un argomento di cultura generale, qui.

Big-O notation is a means of describing how the number of operations required for an algorithm scales as the input grows in size. To use it correctly is to dive deeply into the realm of computer science theory, and to carefully distinguish it from the related small-o notation, big-θ notation, big-Ω notation, and probably many mutant hybrids thereof. While these distinctions add precision to statements about algorithmic scaling, outside computer science theory exams and the remarks of pedantic blog commenters, you’ll rarely see such distinctions made in practice. Far more common in the data science world is a less rigid use of big-O notation: as a general (if imprecise) description of the scaling of an algorithm. With apologies to theorists and pedants, this is the interpretation we’ll use throughout this book.

Big-O notation, in this loose sense, tells you how much time your algorithm will take as you increase the amount of data. If you have an O[N] (read “order N”) algorithm that takes 1 second to operate on a list of length N=1,000, then you should expect it to take roughly 5 seconds for a list of length N=5,000. If you have an O[N2] (read “order N squared”) algorithm that takes 1 second for N=1000, then you should expect it to take about 25 seconds for N=5000.

For our purposes, the N will usually indicate some aspect of the size of the dataset (the number of points, the number of dimensions, etc.). When trying to analyze billions or trillions of samples, the difference between O[N] and O[N2] can be far from trivial!

Notice that the big-O notation by itself tells you nothing about the actual wall-clock time of a computation, but only about its scaling as you change N. Generally, for example, an O[N] algorithm is considered to have better scaling than an O[N2] algorithm, and for good reason. But for small datasets in particular, the algorithm with better scaling might not be faster. For example, in a given problem an O[N2] algorithm might take 0.01 seconds, while a “better” O[N] algorithm might take 1 second. Scale up N by a factor of 1,000, though, and the O[N] algorithm will win out.

Even this loose version of Big-O notation can be very useful when comparing the performance of algorithms, and we’ll use this notation throughout the book when talking about how algorithms scale.

:mrgreen:

NumPy – 30 – sort di arrays – 2

Continuo da qui copiando qui.

Oprdinamento parziale, partizionamento
Sometimes we’re not interested in sorting the entire array, but simply want to find the k smallest values in the array. NumPy provides this in the np.partition function. np.partition takes an array and a number K; the result is a new array with the smallest K values to the left of the partition, and the remaining values to the right, in arbitrary order:

Note that the first three values in the resulting array are the three smallest in the array, and the remaining array positions contain the remaining values. Within the two partitions, the elements have arbitrary order.

Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array:

The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots.

Finally, just as there is a np.argsort that computes indices of the sort, there is a np.argpartition that computes indices of the partition. We’ll see this in action in the following section.

Esempio: i k-prossimi vicini
Let’s quickly see how we might use this argsort function along multiple axes to find the nearest neighbors of each point in a set. We’ll start by creating a random set of 10 points on a two-dimensional plane. Using the standard convention, we’ll arrange these in a 10×2 array: X = rand.rand(10, 2). To get an idea of how these points look, let’s quickly scatter plot them (knear.py):

import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling

X = np.random.rand(10, 2)
plt.scatter(X[:, 0], X[:, 1], s=100);
plt.savefig("np214.png")

Now we’ll compute the distance between each pair of points. Recall that the squared-distance between two points is the sum of the squared differences in each dimension; using the efficient broadcasting (Computation on Arrays: Broadcasting [qui]) and aggregation (Aggregations: Min, Max, and Everything In Between [qui]) routines provided by NumPy we can compute the matrix of square distances in a single line of code:

This operation has a lot packed into it, and it might be a bit confusing if you’re unfamiliar with NumPy’s broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:

Just to double-check what we are doing, we should see that the diagonal of this matrix (i.e., the set of distances between each point and itself) is all zero:

It checks out! With the pairwise square-distances converted, we can now use np.argsort to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:

Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point’s closest neighbor is itself, as we would expect.

By using a full sort here, we’ve actually done more work than we need to in this case. If we’re simply interested in the nearest k neighbors, all we need is to partition each row so that the smallest k+1 squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the np.argpartition function:

In order to visualize this network of neighbors, let’s quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:

Each point in the plot has lines drawn to its two nearest neighbors. At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A.

Although the broadcasting and row-wise sorting of this approach might seem less straightforward than writing a loop, it turns out to be a very efficient way of operating on this data in Python. You might be tempted to do the same type of operation by manually looping through the data and sorting each set of neighbors individually, but this would almost certainly lead to a slower algorithm than the vectorized version we used. The beauty of this approach is that it’s written in a way that’s agnostic to the size of the input data: we could just as easily compute the neighbors among 100 or 1,000,000 points in any number of dimensions, and the code would look the same.

Finally, I’ll note that when doing very large nearest neighbor searches, there are tree-based and/or approximate algorithms that can scale as O[NlogN] or better rather than the O[N2] of the brute-force algorithm. One example of this is the KD-Tree, implemented in Scikit-learn.

:mrgreen: