Archivi Categorie: SciPy

SciPy – 49 – elaborazione di immagini multidimensionali – 1

Continuo da qui, copio qui.

Image processing and analysis are generally seen as operations on two-dimensional arrays of values. There are however a number of fields where images of higher dimensionality must be analyzed. Good examples of these are medical imaging and biological imaging. numpy is suited very well for this type of applications due its inherent multidimensional nature. The scipy.ndimage packages provides a number of general image processing and analysis functions that are designed to operate with arrays of arbitrary dimensionality. The packages currently includes functions for linear and non-linear filtering, binary morphology, B-spline interpolation, and object measurements.

Proprietà condivise da tutte le funzioni
All functions share some common properties. Notably, all functions allow the specification of an output array with the output argument. With this argument you can specify an array that will be changed in-place with the result with the operation. In this case the result is not returned. Usually, using the output argument is more efficient, since an existing array is used to store the result.

The type of arrays returned is dependent on the type of operation, but it is in most cases equal to the type of the input. If, however, the output argument is used, the type of the result is equal to the type of the specified output argument. If no output argument is given, it is still possible to specify what the result of the output should be. This is done by simply assigning the desired numpy type object to the output argument. For example:

Funzioni filtro
The functions described in this section all perform some type of spatial filtering of the input array: the elements in the output are some function of the values in the neighborhood of the corresponding input element. We refer to this neighborhood of elements as the filter kernel, which is often rectangular in shape but may also have an arbitrary footprint. Many of the functions described below allow you to define the footprint of the kernel, by passing a mask through the footprint parameter. For example a cross shaped kernel can be defined as follows:

Usually the origin of the kernel is at the center calculated by dividing the dimensions of the kernel shape by two. For instance, the origin of a one-dimensional kernel of length three is at the second element. Take for example the correlation of a one-dimensional array with a filter of length 3 consisting of ones:

Sometimes it is convenient to choose a different origin for the kernel. For this reason most functions support the origin parameter which gives the origin of the filter relative to its center. For example:

The effect is a shift of the result towards the left. This feature will not be needed very often, but it may be useful especially for filters that have an even size. A good example is the calculation of backward and forward differences:

We could also have calculated the forward difference as follows:

However, using the origin parameter instead of a larger kernel is more efficient. For multidimensional kernels origin can be a number, in which case the origin is assumed to be equal along all axes, or a sequence giving the origin along each axis.

Since the output elements are a function of elements in the neighborhood of the input elements, the borders of the array need to be dealt with appropriately by providing the values outside the borders. This is done by assuming that the arrays are extended beyond their boundaries according certain boundary conditions. In the functions described below, the boundary conditions can be selected using the mode parameter which must be a string with the name of the boundary condition. The following boundary conditions are currently supported:

  • "nearest" Use the value at the boundary [1 2 3]->[1 1 2 3 3]
  • "wrap" Periodically replicate the array [1 2 3]->[3 1 2 3 1]
  • "reflect" Reflect the array at the boundary [1 2 3]->[1 1 2 3 3]
  • "constant" Use a constant value, default is 0.0 [1 2 3]->[0 1 2 3 0]

The "constant" mode is special since it needs an additional parameter to specify the constant value that should be used.

Note: The easiest way to implement such boundary conditions would be to copy the data to a larger array and extend the data at the borders according to the boundary conditions. For large arrays and large filter kernels, this would be very memory consuming, and the functions described below therefore use a different approach that does not require allocating large temporary buffers.

Uhmmm… mi sa che questa cosa dei filtri sarà lunga; pausa 😊


SciPy – 48 – statistica – 5

Continuo da qui, copio qui.

Comparare due campioni
In the following, we are given two samples, which can come either from the same or from different distribution, and we want to test whether these samples have the same statistical properties.

medie dei campioni
Test with sample with identical means:

Test with sample with different means:

test di Kolmogorov-Smirnov per due campioni (ks_2samp)
For the example where both samples are drawn from the same distribution, we cannot reject the null hypothesis since the pvalue is high

In the second example, with different location, i.e. means, we can reject the null hypothesis since the pvalue is below 1%

Stima della densità del kernel
A common task in statistics is to estimate the probability density function (PDF) of a random variable from a set of data samples. This task is called density estimation. The most well-known tool to do this is the histogram. A histogram is a useful tool for visualization (mainly because everyone understands it), but doesn’t use the available data very efficiently. Kernel density estimation (KDE) is a more efficient tool for the same task. The gaussian_kde estimator can be used to estimate the PDF of univariate as well as multivariate data. It works best if the data is unimodal.

stima univariata
We start with a minimal amount of data in order to see how gaussian_kde works, and what the different options for bandwidth selection do. The data sampled from the PDF is show as blue dashes at the bottom of the figure (this is called a rug plot):

Uhmmm… ci sono errori nel codice 😡 mancano le label. La curva nera è quella di Scott, la rossa di Silverman.

We see that there is very little difference between Scott’s Rule and Silverman’s Rule, and that the bandwidth selection with a limited amount of data is probably a bit too wide. We can define our own bandwidth function to get a less smoothed out result.

We see that if we set bandwidth to be very narrow, the obtained estimate for the probability density function (PDF) is simply the sum of Gaussians around each data point.

We now take a more realistic example, and look at the difference between the two available bandwidth selection rules. Those rules are known to work well for (close to) normal distributions, but even for unimodal distributions that are quite strongly non-normal they work reasonably well. As a non-normal distribution we take a Student’s T distribution with 5 degrees of freedom.

Codice troppo lungo, lo raccolgo nel file

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

x1 = np.random.normal(size=200)  # random data, normal distribution
xs = np.linspace(x1.min()-1, x1.max()+1, 200)

kde1 = stats.gaussian_kde(x1)
kde2 = stats.gaussian_kde(x1, bw_method='silverman')

fig = plt.figure(figsize=(8, 6))

ax1 = fig.add_subplot(211)
ax1.plot(x1, np.zeros(x1.shape), 'b+', ms=12)  # rug plot
ax1.plot(xs, kde1(xs), 'k-', label="Scott's Rule")
ax1.plot(xs, kde2(xs), 'b-', label="Silverman's Rule")
ax1.plot(xs, stats.norm.pdf(xs), 'r--', label="True PDF")

ax1.set_title("Normal (top) and Student's T$_{df=5}$ (bottom) distributions")

x2 = stats.t.rvs(5, size=200)  # random data, T distribution
xs = np.linspace(x2.min() - 1, x2.max() + 1, 200)

kde3 = stats.gaussian_kde(x2)
kde4 = stats.gaussian_kde(x2, bw_method='silverman')

ax2 = fig.add_subplot(212)
ax2.plot(x2, np.zeros(x2.shape), 'b+', ms=12)  # rug plot
ax2.plot(xs, kde3(xs), 'k-', label="Scott's Rule")
ax2.plot(xs, kde4(xs), 'b-', label="Silverman's Rule")
ax2.plot(xs, stats.t.pdf(xs, 5), 'r--', label="True PDF")



We now take a look at a bimodal distribution with one wider and one narrower Gaussian feature. We expect that this will be a more difficult density to approximate, due to the different bandwidths required to accurately resolve each feature (

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from functools import partial

def my_kde_bandwidth(obj, fac=1./5):
    """We use Scott's Rule, multiplied by a constant factor."""
    return np.power(obj.n, -1./(obj.d+4)) * fac

loc1, scale1, size1 = (-2, 1, 175)
loc2, scale2, size2 = (2, 0.2, 50)
x2 = np.concatenate([np.random.normal(loc=loc1, scale=scale1, size=size1),
                     np.random.normal(loc=loc2, scale=scale2, size=size2)])

x_eval = np.linspace(x2.min() - 1, x2.max() + 1, 500)

kde = stats.gaussian_kde(x2)
kde2 = stats.gaussian_kde(x2, bw_method='silverman')
kde3 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.2))
kde4 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.5))

pdf = stats.norm.pdf
bimodal_pdf = pdf(x_eval, loc=loc1, scale=scale1) * float(size1) / x2.size + 
              pdf(x_eval, loc=loc2, scale=scale2) * float(size2) / x2.size

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)

ax.plot(x2, np.zeros(x2.shape), 'b+', ms=12)
ax.plot(x_eval, kde(x_eval), 'k-', label="Scott's Rule")
ax.plot(x_eval, kde2(x_eval), 'b-', label="Silverman's Rule")
ax.plot(x_eval, kde3(x_eval), 'g-', label="Scott * 0.2")
ax.plot(x_eval, kde4(x_eval), 'c-', label="Scott * 0.5")
ax.plot(x_eval, bimodal_pdf, 'r--', label="Actual PDF")

ax.set_xlim([x_eval.min(), x_eval.max()])


As expected, the KDE is not as close to the true PDF as we would like due to the different characteristic size of the two features of the bimodal distribution. By halving the default bandwidth (Scott * 0.5) we can do somewhat better, while using a factor 5 smaller bandwidth than the default doesn’t smooth enough. What we really need though in this case is a non-uniform (adaptive) bandwidth.

Stime multivariate
With gaussian_kde we can perform multivariate as well as univariate estimation. We demonstrate the bivariate case. First we generate some random data with a model in which the two variates are correlated (

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def measure(n):
    """Measurement model, return two coupled measurements."""
    m1 = np.random.normal(size=n)
    m2 = np.random.normal(scale=0.5, size=n)
    return m1+m2, m1-m2

m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel.evaluate(positions).T, X.shape)

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)

          extent=[xmin, xmax, ymin, ymax])
ax.plot(m1, m2, 'k.', markersize=2)

ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])


Come già riportato la pagina della reference è in costruzione, ci sono ancora bug nel codice. Per il post corrente sono ricorso a Ralf Gommers, qui.


SciPy – 47 – statistica – 4

Continuo da qui, copio qui.

Analizzare un campione
First, we create some random variables. We set a seed so that in each run we get identical results to look at. As an example we take a sample from the Student t distribution:

Here, we set the required shape parameter of the t distribution, which in statistics corresponds to the degrees of freedom, to 10. Using size=1000 means that our sample consists of 1000 independently drawn (pseudo) random numbers. Since we did not specify the keyword arguments loc and scale, those are set to their default values zero and one.

statistiche descrittive
x is a numpy array, and we have direct access to all array methods, e.g.

How do the some sample properties compare to their theoretical counterparts?

Note: stats.describe uses the unbiased estimator for the variance, while np.var is the biased estimator.

For our sample the sample statistics differ a by a small amount from their theoretical counterparts.

test t e KS
We can use the t-test to test whether the mean of our sample differs in a statistically significant way from the theoretical expectation.

The pvalue is 0.7, this means that with an alpha error of, for example, 10%, we cannot reject the hypothesis that the sample mean is equal to zero, the expectation of the standard t-distribution.

As an exercise, we can calculate our t-test also directly without using the provided function, which should give us the same answer, and so it does:

The Kolmogorov-Smirnov test can be used to test the hypothesis that the sample comes from the standard t-distribution

Again the pvalue is high enough that we cannot reject the hypothesis that the random sample really is distributed according to the t-distribution. In real applications, we don’t know what the underlying distribution is. If we perform the Kolmogorov-Smirnov test of our sample against the standard normal distribution, then we also cannot reject the hypothesis that our sample was generated by the normal distribution given that in this example the pvalue is almost 40%.

However, the standard normal distribution has a variance of 1, while our sample has a variance of 1.29. If we standardize our sample and test it against the normal distribution, then the p-value is again large enough that we cannot reject the hypothesis that the sample came form the normal distribution.

Note: The Kolmogorov-Smirnov test assumes that we test against a distribution with given parameters, since in the last case we estimated mean and variance, this assumption is violated, and the distribution of the test statistic on which the p-value is based, is not correct.

OK, confesso: sono niubbassay 😊

code della distribuzione
Finally, we can check the upper tail of the distribution. We can use the percent point function ppf, which is the inverse of the cdf function, to obtain the critical values, or, more directly, we can use the inverse of the survival function

In all three cases, our sample has more weight in the top tail than the underlying distribution. We can briefly check a larger sample to see if we get a closer match. In this case the empirical frequency is quite close to the theoretical probability, but if we repeat this several times the fluctuations are still pretty large.

We can also compare it with the tail of the normal distribution, which has less weight in the tails:

The chisquare test can be used to test, whether for a finite number of bins, the observed frequencies differ significantly from the probabilities of the hypothesized distribution.

We see that the standard normal distribution is clearly rejected while the standard t-distribution cannot be rejected. Since the variance of our sample differs from both standard distribution, we can again redo the test taking the estimate for scale and location into account.

The fit method of the distributions can be used to estimate the parameters of the distribution, and the test is repeated using probabilities of the estimated distribution.

Taking account of the estimated parameters, we can still reject the hypothesis that our sample came from a normal distribution (at the 5% level), but again, with a p-value of 0.95, we cannot reject the t distribution.

test speciali per distribuzioni normali
Since the normal distribution is the most common distribution in statistics, there are several additional functions available to test whether a sample could have been drawn from a normal distribution

First we can test if skew and kurtosis of our sample differ significantly from those of a normal distribution:

These two tests are combined in the normality test

In all three tests the p-values are very low and we can reject the hypothesis that the our sample has skew and kurtosis of the normal distribution.

Since skew and kurtosis of our sample are based on central moments, we get exactly the same results if we test the standardized sample:

Because normality is rejected so strongly, we can check whether the normaltest gives reasonable results for other cases:

When testing for normality of a small sample of t-distributed observations and a large sample of normal distributed observation, then in neither case can we reject the null hypothesis that the sample comes from a normal distribution. In the first case this is because the test is not powerful enough to distinguish a t and a normally distributed random variable in a small sample.

Già detto che sono niubbassay 😊 per queste cose qui?


SciPy – 46 – statistica – 3

Continuo da qui, copio qui.

Costruire distribuzioni specifiche
The next examples shows how to build your own distributions. Further examples show the usage of the distributions and some statistical tests.

Creare distribuzioni continue, cioè subclassing rv_continuous
Making continuous distributions is fairly simple.

Interestingly, the pdf is now computed automatically:

Be aware of the performance issues mentions in Performance Issues and Cautionary Remarks [post precedente]. The computation of unspecified common methods can become very slow, since only general methods are called which, by their very nature, cannot use any specific information about the distribution. Thus, as a cautionary example:

But this is not correct: the integral over this pdf should be 1. Let’s make the integration interval smaller:

This looks better. However, the problem originated from the fact that the pdf is not specified in the class definition of the deterministic distribution.

Subclassing rv_discrete
In the following we use stats.rv_discrete to generate a discrete distribution that has the probabilities of the truncated normal for the intervals centered around the integers.

General Info: From the docstring of rv_discrete, help(stats.rv_discrete),

“You can construct an arbitrary discrete rv where P{X=xk} = pk by passing to the rv_discrete initialization method (through the values= keyword) a tuple of sequences (xk, pk) which describes only those values of X(xk) that occur with nonzero probability (pk).”

Next to this, there are some further requirements for this approach to work:

  • The keyword name is required.
  • The support points of the distribution xk have to be integers.
  • The number of significant digits (decimals) needs to be specified.

In fact, if the last two requirements are not satisfied an exception may be raised or the resulting numbers may be incorrect.

An Example. Let’s do the work. First

And finally we can subclass rv_discrete:

Now that we have defined the distribution, we have access to all common methods of discrete distributions.

Testing the Implementation. Let’s generate a random sample and compare observed frequencies with the probabilities.

Nota: il codice per i grafici non è agli URLs indicati ma a questo.

Next, we can test, whether our sample was generated by our normdiscrete distribution. This also verifies whether the random numbers are generated correctly.

The chisquare test requires that there are a minimum number of observations in each bin. We combine the tail bins into larger bins so that they contain enough observations.

The pvalue in this case is high, so we can be quite confident that our random sample was actually generated by the distribution.


SciPy – 45 – statistica – 2

Continuo da qui, copio qui.

Spostare e scalare
All continuous distributions take loc and scale as keyword parameters to adjust the location and scale of the distribution, e.g. for the standard normal distribution the location is the mean and the scale is the standard deviation.

In many cases the standardized distribution for a random variable X is obtained through the transformation (X - loc) / scale. The default values are loc = 0 and scale = 1.

Smart use of loc and scale can help modify the standard distributions in many ways. To illustrate the scaling further, the cdf of an exponentially distributed RV with mean 1/λ is given by

By applying the scaling rule above, it can be seen that by taking scale = 1./λ we get the proper scale.

Note: Distributions that take shape parameters may require more than simple application of loc and/or scale to achieve the desired form. For example, the distribution of 2-D vector lengths given a constant vector of length R perturbed by independent N(0, σ2) deviations in each component is rice(R/σ, scale= σ). The first argument is a shape parameter that needs to be scaled along with x.

The uniform distribution is also interesting:

Finally, recall from the previous  paragraph  [post] that we are left with the problem of the meaning of norm.rvs(5). As it turns out, calling a distribution like this, the first argument, i.e., the 5, gets passed to set the loc parameter. Let’s see:

Thus, to explain the output of the example of the last section: norm.rvs(5) generates a single normally distributed random variate with mean loc=5, because of the default size=1.

We recommend that you set loc and scale parameters explicitly, by passing the values as keywords rather than as arguments. Repetition can be minimized when calling more than one method of a given RV by using the technique of Freezing a Distribution, as explained below.

Parametri di forma
While a general continuous random variable can be shifted and scaled with the loc and scale parameters, some distributions require additional shape parameters. For instance, the gamma distribution, with density

requires the shape parameter a. Observe that setting λ can be obtained by setting the scale keyword to 1/λ.

Let’s check the number and name of the shape parameters of the gamma distribution. (We know from the above that this should be 1.)

Now we set the value of the shape variable to 1 to obtain the exponential distribution, so that we compare easily whether we get the results we expect.

Notice that we can also specify shape parameters as keywords:

Congelare una distribuzione
Passing the loc and scale keywords time and again can become quite bothersome. The concept of freezing a RV is used to solve such problems.

By using rv we no longer have to include the scale or the shape parameters anymore. Thus, distributions can be used in one of two ways, either by passing all distribution parameters to each method call (such as we did earlier) or by freezing the parameters for the instance of the distribution. Let us check this:

This is indeed what we should get.

The basic methods pdf and so on satisfy the usual numpy broadcasting rules. For example, we can calculate the critical values for the upper tail of the t distribution for different probabilities and degrees of freedom.

If the array with probabilities, i.e., [0.1, 0.05, 0.01] and the array of degrees of freedom i.e., [10, 11, 12], have the same array shape, then element wise matching is used. As an example, we can obtain the 10% tail for 10 d.o.f., the 5% tail for 11 d.o.f. and the 1% tail for 12 d.o.f. by calling

Punti specifici per distribuzioni discrete
Discrete distribution have mostly the same basic methods as the continuous distributions. However pdf is replaced the probability mass function pmf, no estimation methods, such as fit, are available, and scale is not a valid keyword parameter. The location parameter, keyword loc can still be used to shift the distribution.

The computation of the cdf requires some extra attention. In the case of continuous distribution the cumulative distribution function is in most standard cases strictly monotonic increasing in the bounds (a,b) and has therefore a unique inverse. The cdf of a discrete distribution, however, is a step function, hence the inverse cdf, i.e., the percent point function, requires a different definition:

ppf(q) = min{x : cdf(x) >= q, x integer}

For further info, see the docs here.

We can look at the hypergeometric distribution as an example

If we use the cdf at some integer points and then evaluate the ppf at those cdf values, we get the initial integers back, for example

If we use values that are not at the kinks of the cdf step function, we get the next higher integer back:

Adattare distribuzioni
The main additional methods of the not frozen distribution are related to the estimation of distribution parameters:

  • fit: maximum likelihood estimation of distribution parameters, including location and scale
  • fit_loc_scale: estimation of location and scale when shape parameters are given
  • nnlf: negative log likelihood function
  • expect: Calculate the expectation of a function against the pdf or pmf

Performances e cauzioni
The performance of the individual methods, in terms of speed, varies widely by distribution and method. The results of a method are obtained in one of two ways: either by explicit calculation, or by a generic algorithm that is independent of the specific distribution.

Explicit calculation, on the one hand, requires that the method is directly specified for the given distribution, either through analytic formulas or through special functions in scipy.special or numpy.random for rvs. These are usually relatively fast calculations.

The generic methods, on the other hand, are used if the distribution does not specify any explicit calculation. To define a distribution, only one of pdf or cdf is necessary; all other methods can be derived using numeric integration and root finding. However, these indirect methods can be very slow. As an example, rgh = stats.gausshyper.rvs(0.5, 2, 2, 2, size=100) creates random variables in a very indirect way and takes about 19 seconds for 100 random variables on my computer, while one million random variables from the standard normal or from the t distribution take just above one second.

Problemi rimanenti
The distributions in scipy.stats have recently been corrected and improved and gained a considerable test suite, however a few issues remain:

  • the distributions have been tested over some range of parameters, however in some corner ranges, a few incorrect results may remain.
  • the maximum likelihood estimation in fit does not work with default starting parameters for all distributions and the user needs to supply good starting parameters. Also, for some distribution using a maximum likelihood estimator might inherently not be the best choice.


SciPy – 44 – statistica – 1

Continuo da qui, copio qui.

In this tutorial we discuss many, but certainly not all, features of scipy.stats. The intention here is to provide a user with a working knowledge of this package.

Note: This documentation is work in progress.

Variabili random
There are two general distribution classes that have been implemented for encapsulating continuous random variables and discrete random variables . Over 80 continuous random variables (RVs) and 10 discrete random variables have been implemented using these classes. Besides this, new routines and distributions can easily added by the end user. (If you create one, please contribute it).

All of the statistics functions are located in the sub-package scipy.stats and a fairly complete listing of these functions can be obtained using info(stats). The list of the random variables available can also be obtained from the docstring for the stats sub-package.

In the discussion below we mostly focus on continuous RVs. Nearly all applies to discrete variables also, but we point out some differences here at Specific Points for Discrete Distributions.

In the code samples below we assume that the scipy.stats package is imported as

and in some cases we assume that individual objects are imported as

Trovare aiuto (help)
First of all, all distributions are accompanied with help functions. To obtain just some basic information we print the relevant docstring: print(stats.norm.__doc__).

lunga che non finisce più, completa di esempi.

To find the support, i.e., upper and lower bound of the distribution, call:

We can list all methods and properties of the distribution with dir(norm). As it turns out, some of the methods are private methods although they are not named as such (their name does not start with a leading underscore), for example veccdf, are only available for internal calculation (those methods will give warnings when one tries to use them, and will be removed at some point).

To obtain the real main methods, we list the methods of the frozen distribution. (We explain the meaning of a frozen distribution below).

anche qui liunghissima lista.

Finally, we can obtain the list of available distribution through introspection:

Metodi comuni
The main public methods for continuous RVs are:

  • rvs: Random Variates
  • pdf: Probability Density Function
  • cdf: Cumulative Distribution Function
  • sf: Survival Function (1-CDF)
  • ppf: Percent Point Function (Inverse of CDF)
  • isf: Inverse Survival Function (Inverse of SF)
  • stats: Return mean, variance, (Fisher’s) skew, or (Fisher’s) kurtosis
  • moment: non-central moments of the distribution

Let’s take a normal RV as an example.

To compute the cdf at a number of points, we can pass a list or a numpy array.

Thus, the basic methods such as pdf, cdf, and so on are vectorized with np.vectorize.

Other generally useful methods are supported too:

To find the median of a distribution we can use the percent point function ppf, which is the inverse of the cdf:

To generate a sequence of random variates, use the size keyword argument:

Note that drawing random numbers relies on generators from numpy.random package. In the example above, the specific stream of random numbers is not reproducible across runs. To achieve reproducibility, you can explicitly seed a global variable

Relying on a global state is not recommended though. A better way is to use the random_state parameter which accepts an instance of numpy.random.RandomState class, or an integer which is then used to seed an internal RandomState object:

Don’t think that norm.rvs(5) generates 5 variates:

Here, 5 with no keyword is being interpreted as the first possible keyword argument, loc, which is the first of a pair of keyword arguments taken by all continuous distributions. This brings us to the topic of the next subsection.


SciPy – 43 – strutture di dati spaziali e algoritmi – 2

Continuo da qui, copio qui.

Gusci convessi
Traduco così “convex hulls”, sperando… 😊

Convex hull is the smallest convex object containing all points in a given point set.

These can be computed via the Qhull wrappers in scipy.spatial as follows:

The convex hull is represented as a set of N-1 dimensional simplices, which in 2-D means line segments. The storage scheme is exactly the same as for the simplices in the Delaunay triangulation discussed above.

We can illustrate the above result:

The same can be achieved with scipy.spatial.convex_hull_plot_2d.
Diagrammi di Voronoi
A Voronoi diagram is a subdivision of the space into the nearest neighborhoods of a given set of points.

There are two ways to approach this object using scipy.spatial. First, one can use the KDTree to answer the question “which of the points is closest to this one”, and define the regions that way:

So the point (0.1, 0.1) belongs to region 0. In color:

This does not, however, give the Voronoi diagram as a geometrical object.
The representation in terms of lines and points can be again obtained via the Qhull wrappers in scipy.spatial:

The Voronoi vertices denote the set of points forming the polygonal edges of the Voronoi regions. In this case, there are 9 different regions:

Negative value -1 again indicates a point at infinity. Indeed, only one of the regions, [3, 1, 0, 2], is bounded. Note here that due to similar numerical precision issues as in Delaunay triangulation above, there may be fewer Voronoi regions than input points.

The ridges (lines in 2-D) separating the regions are described as a similar collection of simplices as the convex hull pieces:

These numbers indicate indices of the Voronoi vertices making up the line segments. -1 is again a point at infinity — only four of the 12 lines is a bounded line segment while the others extend to infinity.

The Voronoi ridges are perpendicular to lines drawn between the input points. Which two points each ridge corresponds to is also recorded:

This information, taken together, is enough to construct the full diagram.

We can plot it as follows. First the points and the Voronoi vertices:

Plotting the finite line segments goes as for the convex hull, but now we have to guard for the infinite edges:

The ridges extending to infinity require a bit more care:

This plot can also be created using scipy.spatial.voronoi_plot_2d.


SciPy – 42 – strutture di dati spaziali e algoritmi – 1

Continuo da qui, copio qui.

scipy.spatial can compute triangulations, Voronoi diagrams, and convex hulls of a set of points, by leveraging the Qhull library.

Moreover, it contains KDTree implementations for nearest-neighbor point queries, and utilities for distance computations in various metrics.

Triangolazione Delaunay
The Delaunay triangulation is a subdivision of a set of points into a non-overlapping set of triangles, such that no point is inside the circumcircle of any triangle. In practice, such triangulations tend to avoid triangles with small angles.

Delaunay triangulation can be computed using scipy.spatial as follows:

We can visualize it:

The structure of the triangulation is encoded in the following way: the simplices attribute contains the indices of the points in the points array that make up the triangle. For instance:

Moreover, neighboring triangles can also be found out:

What this tells us is that this triangle has triangle #0 as a neighbor, but no other neighbors. Moreover, it tells us that neighbor 0 is opposite the vertex 1 of the triangle:

Indeed, from the figure we see that this is the case.

Qhull can also perform tesselations to simplices also for higher-dimensional point sets (for instance, subdivision into tetrahedra in 3-D).

Punti coplanari
It is important to note that not all points necessarily appear as vertices of the triangulation, due to numerical precision issues in forming the triangulation. Consider the above with a duplicated point:

Observe that point #4, which is a duplicate, does not occur as a vertex of the triangulation. That this happened is recorded:

This means that point 4 resides near triangle 0 and vertex 3, but is not included in the triangulation.

Note that such degeneracies can occur not only because of duplicated points, but also for more complicated geometrical reasons, even in point sets that at first sight seem well-behaved.

However, Qhull has the "QJ" option, which instructs it to perturb the input data randomly until degeneracies are resolved:

Two new triangles appeared. However, we see that they are degenerate and have zero area.


SciPy – 41 – routines grafiche compressed sparse

Continuo da qui, copio qui.

Esempio: il gioco Word Ladders
A Word Ladder is a word game invented by Lewis Carroll in which players find paths between words by switching one letter at a time. For example, one can link “ape” and “man” in the following way:

Note that each step involves changing just one letter of the word. This is just one possible path from “ape” to “man”, but is it the shortest possible path? If we desire to find the shortest word ladder path between two given words, the sparse graph submodule can help.

First we need a list of valid words. Many operating systems have such a list built-in. For example, on linux, a word list can often be found at one of the following locations:


Another easy source for words are the scrabble word lists available at various sites around the internet (search with your favorite search engine). We’ll first create this list. The system word lists consist of a file with one word per line. The following should be modified to use the particular word list you have available

We want to look at words of length 3, so let’s select just those words of the correct length. We’ll also eliminate words which start with upper-case (proper nouns) or contain non alpha-numeric characters like apostrophes and hyphens. Finally, we’ll make sure everything is lower-case for comparison later:

Nota: nella versione 3 map() non ha il metodo len() per cui ho modificato l’istruzione 7. Inoltre ho una word in più, l’entropia e l’inglese stanno crescendo 😜

Il codice proposto non funziona anche per le istruzioni susseguenti; un po’ di stackoverflowaggio ed ecco qua la dritta.

Now we have a list of  586  587 valid three-letter words (the exact number may change depending on the particular list used). Each of these words will become a node in our graph, and we will create edges connecting the nodes associated with each pair of words which differs by only one letter.

There are efficient ways to do this, and inefficient ways to do this. To do this as efficiently as possible, we’re going to use some sophisticated numpy array manipulation:

We have an array where each entry is three bytes. We’d like to find all pairs where exactly one byte is different. We’ll start by converting each word to a three-dimensional vector:

OK, non tornano le dimensioni, chissà…

Now we’ll use the Hamming distance between each point to determine which pairs of words are connected. The Hamming distance measures the fraction of entries between two vectors which differ: any two words with a hamming distance equal to 1/N, where N is the number of letters, are connected in the word ladder:

When comparing the distances, we don’t use an equality because this can be unstable for floating point values. The inequality produces the desired result as long as no two entries of the word list are identical. Now that our graph is set up, we’ll use a shortest path search to find the path between any two words in the graph:

We need to check that these match, because if the words are not in the list that will not be the case. Now all we need is to find the shortest path between these two indices in the graph. We’ll use Dijkstra’s algorithm, because it allows us to find the path for just one node:

So we see that the shortest path between ‘ape’ and ‘man’ contains only five steps. We can use the predecessors returned by the algorithm to reconstruct this path:

This is three fewer links than our initial example: the path from ape to man is only five steps.

Using other tools in the module, we can answer other questions. For example, are there three-letter words which are not linked in a word ladder? This is a question of connected components in the graph:

In this particular sample of three-letter words, there are  15  16 connected components: that is,  15  16 distinct sets of words with no paths between the sets. How many words are in each of these sets? We can learn this from the list of components:

There is one large connected set, and 14 smaller ones. Let’s look at the words in the smaller ones:

These are all the three-letter words which do not connect to others via a word ladder.

We might also be curious about which words are maximally separated. Which two words take the most links to connect? We can determine this by computing the matrix of all shortest paths. Note that by convention, the distance between two non-connected points is reported to be infinity, so we’ll need to remove these before finding the maximum:

So there is at least one pair of words which takes 13 steps to get from one to the other! Let’s determine which these are:

uhmmm… chissà come si visualizza lo zip?

We see that there are two pairs of words which are maximally separated from each other: ‘imp’ and ‘ump’ on one hand, and ‘ohm’ and ‘ohs’ on the other hand. We can find the connecting list in the same way as above:

This gives us the path we desired to see.

Word ladders are just one potential application of scipy’s fast graph algorithms for sparse matrices. Graph theory makes appearances in many areas of mathematics, data analysis, and machine learning. The sparse graph tools are flexible enough to handle many of these situations.


SciPy – 40 – Autovalori con ARPACK

Continuo da qui, copio qui.

ARPACK is a Fortran package which provides routines for quickly finding a few eigenvalues/eigenvectors of large sparse matrices. In order to find these solutions, it requires only left-multiplication by the matrix in question. This operation is performed through a reverse-communication interface. The result of this structure is that ARPACK is able to find eigenvalues and eigenvectors of any linear function mapping a vector to a vector.

All of the functionality provided in ARPACK is contained within the two high-level interfaces scipy.sparse.linalg.eigs and scipy.sparse.linalg.eigsh. eigs provides interfaces to find the eigenvalues/vectors of real or complex nonsymmetric square matrices, while eigsh provides interfaces for real-symmetric or complex-hermitian matrices.

Funzionalità di base
ARPACK can solve either standard eigenvalue problems of the form

or general eigenvalue problems of the form

The power of ARPACK is that it can compute only a specified subset of eigenvalue/eigenvector pairs. This is accomplished through the keyword which. The following values of which are available:

  • which = 'LM': Eigenvalues with largest magnitude (eigs, eigsh), that is, largest eigenvalues in the euclidean norm of complex numbers.
  • which = 'SM': Eigenvalues with smallest magnitude (eigs, eigsh), that is, smallest eigenvalues in the euclidean norm of complex numbers.
  • which = 'LR': Eigenvalues with largest real part (eigs)
  • which = 'SR': Eigenvalues with smallest real part (eigs)
  • which = 'LI': Eigenvalues with largest imaginary part (eigs)
  • which = 'SI': Eigenvalues with smallest imaginary part (eigs)
  • which = 'LA': Eigenvalues with largest algebraic value (eigsh), that is, largest eigenvalues inclusive of any negative sign.
  • which = 'SA': Eigenvalues with smallest algebraic value (eigsh), that is, smallest eigenvalues inclusive of any negative sign.
  • which = 'BE': Eigenvalues from both ends of the spectrum (eigsh)

Note that ARPACK is generally better at finding extremal eigenvalues: that is, eigenvalues with large magnitudes. In particular, using which = 'SM' may lead to slow execution time and/or anomalous results. A better approach is to use shift-invert mode.

Modalità shift-invert
Shift invert mode relies on the following observation. For the generalized eigenvalue problem

it can be shown that


Imagine you’d like to find the smallest and largest eigenvalues and the corresponding eigenvectors for a large matrix. ARPACK can handle many forms of input: dense matrices such as numpy.ndarray instances, sparse matrices such as scipy.sparse.csr_matrix, or a general linear operator derived from scipy.sparse.linalg.LinearOperator. For this example, for simplicity, we’ll construct a symmetric, positive-definite matrix.

We now have a symmetric matrix X with which to test the routines. First compute a standard eigenvalue decomposition using eigh:

As the dimension of X grows, this routine becomes very slow. Especially if only a few eigenvectors and eigenvalues are needed, ARPACK can be a better option. First let’s compute the largest eigenvalues (which = 'LM') of X and compare them to the known results:

The results are as expected. ARPACK recovers the desired eigenvalues, and they match the previously known results. Furthermore, the eigenvectors are orthogonal, as we’d expect. Now let’s attempt to solve for the eigenvalues with smallest magnitude:

Uhmmm… Non mi ha dato l’errore previsto, doveva essere

>>> evals_small, evecs_small = eigsh(X, 3, which='SM')
Traceback (most recent call last):       # may vary (convergence)
ARPACK error -1: No convergence (1001 iterations, 0/3 eigenvectors converged)

Oops. We see that as mentioned above, ARPACK is not quite as adept at finding small eigenvalues. There are a few ways this problem can be addressed. We could increase the tolerance (tol) to lead to faster convergence:

This works, but we lose the precision in the results. Another option is to increase the maximum number of iterations (maxiter) from 1000 to 5000:

We get the results we’d hoped for, but the computation time is much longer. Fortunately, ARPACK contains a mode that allows quick determination of non-external eigenvalues: shift-invert mode. As mentioned above, this mode involves transforming the eigenvalue problem to an equivalent problem with different eigenvalues. In this case, we hope to find eigenvalues near zero, so we’ll choose sigma = 0. The transformed eigenvalues will then satisfy ν=1/(σ−λ)=1/λ, so our small eigenvalues λ become large eigenvalues ν.

We get the results we were hoping for, with much less computational time. Note that the transformation from ν→λ takes place entirely in the background. The user need not worry about the details.

The shift-invert mode provides more than just a fast way to obtain a few small eigenvalues. Say you desire to find internal eigenvalues and eigenvectors, e.g. those nearest to λ=1. Simply set sigma = 1 and ARPACK takes care of the rest:

The eigenvalues come out in a different order, but they’re all there. Note that the shift-invert mode requires the internal solution of a matrix inverse. This is taken care of automatically by eigsh and eigs, but the operation can also be specified by the user. See the docstring of scipy.sparse.linalg.eigsh and scipy.sparse.linalg.eigs for details.

Per chi volesse saperne di più c’è oò sito di ARPACK.