NumPy – 30 – sort di arrays – 2

Continuo da qui copiando qui.

Oprdinamento parziale, partizionamento
Sometimes we’re not interested in sorting the entire array, but simply want to find the k smallest values in the array. NumPy provides this in the np.partition function. np.partition takes an array and a number K; the result is a new array with the smallest K values to the left of the partition, and the remaining values to the right, in arbitrary order:

Note that the first three values in the resulting array are the three smallest in the array, and the remaining array positions contain the remaining values. Within the two partitions, the elements have arbitrary order.

Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array:

The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots.

Finally, just as there is a np.argsort that computes indices of the sort, there is a np.argpartition that computes indices of the partition. We’ll see this in action in the following section.

Esempio: i k-prossimi vicini
Let’s quickly see how we might use this argsort function along multiple axes to find the nearest neighbors of each point in a set. We’ll start by creating a random set of 10 points on a two-dimensional plane. Using the standard convention, we’ll arrange these in a 10×2 array: X = rand.rand(10, 2). To get an idea of how these points look, let’s quickly scatter plot them (knear.py):

import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling

X = np.random.rand(10, 2)
plt.scatter(X[:, 0], X[:, 1], s=100);
plt.savefig("np214.png")

Now we’ll compute the distance between each pair of points. Recall that the squared-distance between two points is the sum of the squared differences in each dimension; using the efficient broadcasting (Computation on Arrays: Broadcasting [qui]) and aggregation (Aggregations: Min, Max, and Everything In Between [qui]) routines provided by NumPy we can compute the matrix of square distances in a single line of code:

This operation has a lot packed into it, and it might be a bit confusing if you’re unfamiliar with NumPy’s broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:

Just to double-check what we are doing, we should see that the diagonal of this matrix (i.e., the set of distances between each point and itself) is all zero:

It checks out! With the pairwise square-distances converted, we can now use np.argsort to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:

Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point’s closest neighbor is itself, as we would expect.

By using a full sort here, we’ve actually done more work than we need to in this case. If we’re simply interested in the nearest k neighbors, all we need is to partition each row so that the smallest k+1 squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the np.argpartition function:

In order to visualize this network of neighbors, let’s quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:

Each point in the plot has lines drawn to its two nearest neighbors. At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A.

Although the broadcasting and row-wise sorting of this approach might seem less straightforward than writing a loop, it turns out to be a very efficient way of operating on this data in Python. You might be tempted to do the same type of operation by manually looping through the data and sorting each set of neighbors individually, but this would almost certainly lead to a slower algorithm than the vectorized version we used. The beauty of this approach is that it’s written in a way that’s agnostic to the size of the input data: we could just as easily compute the neighbors among 100 or 1,000,000 points in any number of dimensions, and the code would look the same.

Finally, I’ll note that when doing very large nearest neighbor searches, there are tree-based and/or approximate algorithms that can scale as O[NlogN] or better rather than the O[N2] of the brute-force algorithm. One example of this is the KD-Tree, implemented in Scikit-learn.

:mrgreen:

Advertisements
Post a comment or leave a trackback: Trackback URL.

Trackbacks

Rispondi

Inserisci i tuoi dati qui sotto o clicca su un'icona per effettuare l'accesso:

Logo WordPress.com

Stai commentando usando il tuo account WordPress.com. Chiudi sessione / Modifica )

Foto Twitter

Stai commentando usando il tuo account Twitter. Chiudi sessione / Modifica )

Foto di Facebook

Stai commentando usando il tuo account Facebook. Chiudi sessione / Modifica )

Google+ photo

Stai commentando usando il tuo account Google+. Chiudi sessione / Modifica )

Connessione a %s...

%d blogger hanno fatto clic su Mi Piace per questo: