JavaScript 54 – espressioni regolari – 6

Continuo da qui, copio qui.

Avidità
Greed dice Marijn: e io chi sono per non tradurre?

It isn’t hard to use replace to write a function that removes all comments from a piece of JavaScript code. Here is a first attempt (file re24.js):

function stripComments(code) {
  return code.replace(/\/\/.*|\/\*[^]*\*\//g, "");
}

console.log(stripComments("1 + /* 2 */3"));
console.log(stripComments("x = 10;// ten!"));
console.log(stripComments("1 /* a */+/* b */ 1"));

The part before the or operator simply matches two slash characters followed by any number of non-newline characters. The part for multiline comments is more involved. We use [^] (any character that is not in the empty set of characters) as a way to match any character. We cannot just use a dot here because block comments can continue on a new line, and dots do not match the newline character.

But the output of the previous example appears to have gone wrong. Why?

The [^]* part of the expression, as I described in the section on backtracking, will first match as much as it can. If that causes the next part of the pattern to fail, the matcher moves back one character and tries again from there. In the example, the matcher first tries to match the whole rest of the string and then moves back from there. It will find an occurrence of */ after going back four characters and match that. This is not what we wanted—the intention was to match a single comment, not to go all the way to the end of the code and find the end of the last block comment.

Because of this behavior, we say the repetition operators (+, *, ?, and {}) are greedy, meaning they match as much as they can and backtrack from there. If you put a question mark after them (+?, *?, ??, {}?), they become nongreedy and start by matching as little as possible, matching more only when the remaining pattern does not fit the smaller match.

And that is exactly what we want in this case. By having the star match the smallest stretch of characters that brings us to a */, we consume one block comment and nothing more (re25.js).

function stripComments(code) {
  return code.replace(/\/\/.*|\/\*[^]*?\*\//g, "");
}

console.log(stripComments("1 /* a */+/* b */ 1"));

A lot of bugs in regular expression programs can be traced to unintentionally using a greedy operator where a nongreedy one would work better. When using a repetition operator, consider the nongreedy variant first.

:mrgreen:

NumPy – 93 – introduzione a Scikit-Learn – 2

Continuo da qui, copio qui.

Estimator API di Scikit-Learn
The Scikit-Learn API is designed with the following guiding principles in mind, as outlined in the Scikit-Learn API paper:

  • Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.
  • Inspection: All specified parameter values are exposed as public attributes.
  • Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.
  • Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.
  • Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.

In practice, these principles make Scikit-Learn very easy to use, once the basic principles are understood. Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface for a wide range of machine learning applications.

Basi delle API
Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow).

  • Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
  • Choose model hyperparameters by instantiating this class with desired values.
  • Arrange data into a features matrix and target vector following the discussion above.
  • Fit the model to your data by calling the fit() method of the model instance.
  • Apply the Model to new data:
    For supervised learning, often we predict labels for unknown data using the predict() method.
    For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.

We will now step through several simple examples of applying supervised and unsupervised learning methods.

Esempio supervisionato: regressione lineare semplice
As an example of this process, let’s consider a simple linear regression—that is, the common case of fitting a line to (x,y) data. We will use the following simple data for our regression example:

With this data in place, we can use the recipe outlined earlier. Let’s walk through the process:

1. scegliere una classe di modello
In Scikit-Learn, every class of model is represented by a Python class. So, for example, if we would like to compute a simple linear regression model, we can import the linear regression class:

Note that other more general linear regression models exist as well; you can read more about them in the sklearn.linear_model module documentation.

2. scegliere gli iperparametri del modello
An important point is that a class of model is not the same as an instance of a model.

Once we have decided on our model class, there are still some options open to us. Depending on the model class we are working with, we might need to answer one or more questions like the following:

  • Would we like to fit for the offset (i.e., y-intercept)?
  • Would we like the model to be normalized?
  • Would we like to preprocess our features to add model flexibility?
  • What degree of regularization would we like to use in our model?
  • How many model components would we like to use?

These are examples of the important choices that must be made once the model class is selected. These choices are often represented as hyperparameters, or parameters that must be set before the model is fit to data. In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. We will explore how you can quantitatively motivate the choice of hyperparameters in Hyperparameters and Model Validation [prossimamente].

For our linear regression example, we can instantiate the LinearRegression class and specify that we would like to fit the intercept using the fit_intercept hyperparameter:

Keep in mind that when the model is instantiated, the only action is the storing of these hyperparameter values. In particular, we have not yet applied the model to any data: the Scikit-Learn API makes very clear the distinction between choice of model and application of model to data.

3. organizzare i dati in una feature matrix e un target vector
Previously [post precedente] we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix and a one-dimensional target array. Here our target variable y is already in the correct form (a length-n_samples array), but we need to massage the data x to make it a matrix of size [n_samples, n_features]. In this case, this amounts to a simple reshaping of the one-dimensional array:

4. inserire il modello nei dati
Now it is time to apply our model to data. This can be done with the fit() method of the model:

This fit() command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore. In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have the following:

These two parameters represent the slope and intercept of the simple linear fit to the data. Comparing to the data definition, we see that they are very close to the input slope of 2 and intercept of -1.

One question that frequently comes up regards the uncertainty in such internal model parameters. In general, Scikit-Learn does not provide tools to draw conclusions from internal model parameters themselves: interpreting model parameters is much more a statistical modeling question than a machine learning question. Machine learning rather focuses on what the model predicts. If you would like to dive into the meaning of fit parameters within the model, other tools are available, including the Statsmodels Python package.

5. predire le etichette per dati non conosciuti
Once the model is trained, the main task of supervised machine learning is to evaluate it based on what it says about new data that was not part of the training set. In Scikit-Learn, this can be done using the predict() method. For the sake of this example, our “new data” will be a grid of x values, and we will ask what y values the model predicts:

As before, we need to coerce these x values into a [n_samples, n_features] features matrix, after which we can feed it to the model:

Finally, let’s visualize the results by plotting first the raw data, and then this model fit:

Typically the efficacy of the model is evaluated by comparing its results to some known baseline, as we will see in the next example

Pausa ma poi si continua 😊

:mrgreen:

SICP – cap. 2 – Sequenze come interfacce convenzionali – 42 – esercizi

Continuo da qui, copio qui.

Exercise 2.33: Fill in the missing expressions to complete the following definitions of some basic list-manipulation operations as accumulations:

(define (map p sequence)
  (accumulate (lambda (x y) ⟨??⟩) 
              nil sequence))

(define (append seq1 seq2)
  (accumulate cons ⟨??⟩ ⟨??⟩))

(define (length sequence)
  (accumulate ⟨??⟩ 0 sequence))

accumulate è stata definita nel post precedente, così

(define (accumulate op initial sequence)
  (if (null? sequence)
      initial
      (op (car sequence)
          (accumulate op 
                      initial 
                      (cdr sequence)))))

e può essere usata così

per cui è proprio quello che serve

map
prevede una procedura da applicare a tutti gli elementi della lista passata come secondo parametro, come nel terzo degli esempi di accumulate, per cui

append
proprio come viene subito da pensare, basta il cons delle due sequenze. Anzi no, c’è un quirk (come spega bene Bill), l’ordine delle sequenze dev’essere invertito

length
sembrava facile e invece (mai fidarsi del profs) accumulate opera sugli elementi della sequenza. Mi sono arreso –confesso– e ho copiato Bill:  How do we get it to just return the length of the sequence? We can do that by simply giving it an operation that will ignore each element of the sequence, and just increment the accumulated value.

I nerds di rifermento: Bill the Lizard, sicp-ex e Drewiki.
length è uguale per tutti; chissà se sono entanglate? 😉

Una nota per me: con la versione 6.9 di Racket in modalità interattiva la libreria xrepl viene caricata automaticamente per cui l’alias da me usato diventa un po’ ridondante ma l’abitudine…

:mrgreen:

Visto nel Web – 289

In attesa dei nuovi voucher –completamente diversi dai vecchi 😜 mooolto più sexy– ecco cosa ho wisto nel Web 🤖


JSON Path
#:programming, codice, snippet
::: The Ubuntu Incident

Stealing Windows Credentials Using Google Chrome
#:sicurezza, spionaggio, virus
::: Slashdot

EU Passes ‘Content Portability’ Rules Banning Geofencing
#:Web, Internet
::: Slashdot

New SMB Worm Uses Seven NSA Hacking Tools. WannaCry Used Just Two
#:sicurezza, spionaggio, virus
::: Slashdot

Il co-fondatore di Twitter: “Se Trump ha vinto grazie al nostro social, mi dispiace e mi scuso”
#:social media
::: la Stampa ::: manteblog

Windows 7, not XP, was the reason last week’s WCry worm spread so widely

#:sicurezza, spionaggio, virus
::: dangoodin001

65 years ago today IBM entered the computer business w/the 701
#:storia
::: MIT_CSAIL

Gli azionisti di Twitter mettono al voto il progetto di trasformare la società in una cooperativa
#:economia
::: mazzettam

Garry Kasparov on how much computer programs have to learn
#:artificial intelligence
::: wallingf

Rivelate le policy di moderazione di Facebook. Ci sarà parecchio da discutere
#:social media
::: mazzettam

Vint Cerf Reflects On The Last 60 Years
#:protagonisti
::: Slashdot

Did China Hack The CIA In A Massive Intelligence Breach From 2010 To 2012?
#:sicurezza, spionaggio, virus
::: Slashdot

Why The US Government Open Sources Its Code
#:free open source software
::: Slashdot

Julian Assange Still Faces Legal Jeopardy In Three Countries
#:sicurezza, spionaggio, virus
::: Slashdot

Google in, Google out
Big G ha conquistato il mondo
#:ditte
::: TechCrunch

A great article on Recursion’s work
#:artificial intelligence
::: RecursionChris

Non è vero che mancano competenze digitali in Italia, il problema è delle aziende
#:economia
::: aldoceccarelli

Report: Ford’s CEO will be replaced with the head of its autonomous vehicle subsidiary
#:innovazioni, futuro
::: TechCrunch

When universities sell patents to trolls, everyone loses
#:copyright e brevetti
::: EFF

Linux utils that you might not know
#:tools, componenti software
::: Donearm

Scoop: whoever’s spamming the @FCC w fake anti-#netneutrality comments is using their API which means agency knows who it is but won’t say
#:Web, Internet
::: evan_greer

Pensieri sul nuovo ddl intercettazioni: WannaCry
#:sicurezza, spionaggio, virus #:politica
::: CBlengio

A Basic language inspired LISP dialect!? Heresy! Heresy!
#:lisp(s)
::: kuzrob

As was probably inevitable, I missed one viz library in my #PyCon2017 talk: a ggplot-inspired matplotlib wrapper
#:linguaggi di programmazione
::: jakevdp

Tech adoption skyrockets among older adults
#:Web, Internet
::: ScottHaber_

Il DDL @AndreaOrlandosp prevede l’uso dei trojan nelle indagini penali. Una delle cose più inquietanti mai viste
#:sicurezza, spionaggio, virus
::: gditom

The Supreme Court Is Cracking Down on Patent Trolls
#:copyright e brevetti
::: Slashdot

Pittsburgh Is Falling Out of Love With Uber’s Self-Driving Cars
#:innovazioni, futuro
::: Slashdot

Ethereum Could Be Worth More Than Bitcoin Very Soon
#:Web, Internet #:economia
::: Slashdot

Researchers find computer code that Volkswagen used to cheat emissions tests
#:frodi
::: thorstenholz

There was a manual and a tutorial. From Dennis Ritchie’s home page
#:storia
::: landley

An introduction to Libral, a systems management library for Linux
#:programming, codice, snippet
::: lucaciavatta

I Premi Turing: John Warner Backus
#:storia
::: Mr. Palomar

Vectr Graphics App Lands in Ubuntu Software Store
#:tools, componenti software
::: dcavedon

Counting exactly the number of distinct elements: sorted arrays vs. hash sets?
#:programming, codice, snippet
::: Daniel Lemire

The Hackett rewrite is now merged to master, and it’s available to try for the especially adventurous
#:lisp(s)
::: lexi_lambda

New post: “Stack Overflow: Helping One Million Developers Exit Vim”
non so quanto sia vera; mi verrebbe voglia di fare :q! di brutto; o è meglio –più prudente– wq 😯
#:tip, suggerimenti
::: drob ::: lunaryorn

#prolog a few days ago Alain Colmerauer passed away
#:protagonisti
::: RainerJoswig

Wikimedia Is Clear To Sue the NSA Over Its Use of Warrantless Surveillance Tools
#:sicurezza, spionaggio, virus
::: Slashdot

China Censored Google’s AlphaGo Match Against World’s Best Go Player
#:artificial intelligence #:sicurezza, spionaggio, virus
::: Slashdot

DEFCON Conference To Target Voting Machines
#:sicurezza, spionaggio, virus
::: Slashdot

Want a Raspberry Pi-powered PC? This $50 case turns the Pi into a desktop
mi vrrebbe kwasy-kwasy la tentassione 😎 😜
#:hardware
::: dcavedon

Blogging, now and ever
#:Web, Internet
::: Grab the Blaster

Apple Wants To Turn Community College Students Into App Developers
#:ditte
::: Slashdot

The Trump Administration Wants To Be Able To Track and Hack Your Drone
#:sicurezza, spionaggio, virus
::: Slashdot

Gianfranco Bo

Windows Switch To Git Almost Complete: 8,500 Commits and 1,760 Builds Each Day
#:programming, codice, snippet
::: Slashdot ::: nicolaiarocci

Robot Police Officer Goes On Duty In Dubai
#:innovazioni, futuro
::: Slashdot

Manchester Attack Could Lead To Internet Crackdown
#:sicurezza, spionaggio, virus
::: Slashdot

QtCreator 4.3 released
::: meetingcpp

Writing modern JavaScript code
#:linguaggi di programmazione
::: ThePracticalDev

Sketchpad (1962), the first graphic CAD software, one of the most influential programs ever written
#:storia
::: computertales

Windows 10 S Won’t Let Users Run Linux Distros
#:sistemi operativi
::: dcavedon

Chrome Won
#:Web, Internet
::: Donearm

Read Mark Zuckerberg’s full commencement address at Harvard
il prossimo presidente, forse non solo USA
#:protagonisti
::: fabiochiusi

Today we are proud and very excited to announce the stable release of Devuan 1.0.0 Jessie LTS
#:sistemi operativi
::: DevuanOrg

My tech stack if I had to build an app today
#:Web, Internet
::: ThePracticalDev

Typescript Unit Test for Web Applications
#:linguaggi di programmazione #:Web, Internet
::: thek3nger

For Modern Astronomers, It’s Learn to Code or Get Left Behind
bella intro alla programmazione con dentro tante cose
#:programming, codice, snippet
::: iva_momcheva

La Microsoft dipendenza colpisce la UE?
#:sistemi operativi #:ditte
::: dcavedon

Science and Technology links
#:elenco links
::: Daniel Lemire

A decentralized web would give power back to the people online
#:Web, Internet
::: TechCrunch

A New Amiga Arrives On the Scene — the A-EON Amiga X5000
#:sistemi operativi #:hardware
::: Slashdot

JavaScript 53 – espressioni regolari – 5

Continuo da qui, copio qui.

Il metodo replace
String values have a replace method, which can be used to replace part of the string with another string (file re19.js).

console.log("papa".replace("p", "m"));

The first argument can also be a regular expression, in which case the first match of the regular expression is replaced. When a g option (for global) is added to the regular expression, all matches in the string will be replaced, not just the first (re20.js).

console.log("Borobudur".replace(/[ou]/, "a"));
console.log("Borobudur".replace(/[ou]/g, "a"));

It would have been sensible if the choice between replacing one match or all matches was made through an additional argument to replace or by providing a different method, replaceAll. But for some unfortunate reason, the choice relies on a property of the regular expression instead.

Le espressioni regolari hanno una lunga storia e sono presenti (da sempre, quasi) nei linguaggi di programmazione. È, secondo me, questo il motivo che giustifica le opzioni come g.

The real power of using regular expressions with replace comes from the fact that we can refer back to matched groups in the replacement string. For example, say we have a big string containing the names of people, one name per line, in the format Lastname, Firstname. If we want to swap these names and remove the comma to get a simple Firstname Lastname format, we can use the following code (re21.js):

console.log(
  "Hopper, Grace\nMcCarthy, John\nRitchie, Dennis"
    .replace(/([\w ]+), ([\w ]+)/g, "$2 $1"));

The $1 and $2 in the replacement string refer to the parenthesized groups in the pattern. $1 is replaced by the text that matched against the first group, $2 by the second, and so on, up to $9. The whole match can be referred to with $&.

It is also possible to pass a function, rather than a string, as the second argument to replace. For each replacement, the function will be called with the matched groups (as well as the whole match) as arguments, and its return value will be inserted into the new string.

Here’s a simple example (re22.js):

var s = "the cia and fbi";
console.log(s.replace(/\b(fbi|cia)\b/g, function(str) {
  return str.toUpperCase();
}));

And here’s a more interesting one (re23.js):

var stock = "1 lemon, 2 cabbages, and 101 eggs";
function minusOne(match, amount, unit) {
  amount = Number(amount) - 1;
  if (amount == 1) // only one left, remove the 's'
    unit = unit.slice(0, unit.length - 1);
  else if (amount == 0)
    amount = "no";
  return amount + " " + unit;
}
console.log(stock.replace(/(\d+) (\w+)/g, minusOne));

This takes a string, finds all occurrences of a number followed by an alphanumeric word, and returns a string wherein every such occurrence is decremented by one.

The (\d+) group ends up as the amount argument to the function, and the (\w+) group gets bound to unit. The function converts amount to a number—which always works, since it matched \d+ —and makes some adjustments in case there is only one or zero left.

Pausa 😯 sarà ancora lunga.

:mrgreen:

NumPy – 92 – introduzione a Scikit-Learn – 1

Continuo da qui, copio qui.

There are several Python libraries which provide solid implementations of a range of machine learning algorithms. One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.

This section provides an overview of the Scikit-Learn API; a solid understanding of these API elements will form the foundation for understanding the deeper practical discussion of machine learning algorithms and approaches in the following  chapters [posts].

We will start by covering data representation in Scikit-Learn, followed by covering the Estimator API, and finally go through a more interesting example of using these tools for exploring a set of images of hand-written digits.

Rappresentazione dei dati con Scikit-Learn
Machine learning is about creating models from data: for that reason, we’ll start by discussing how data can be represented in order to be understood by the computer. The best way to think about data within Scikit-Learn is in terms of tables of data.

dati come tabelle
A basic table is a two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the columns represent quantities related to each of these elements. For example, consider the Iris dataset, famously analyzed by Ronald Fisher in 1936. We can download this dataset in the form of a Pandas DataFrame using the seaborn library:


Here each row of the data refers to a single observed flower, and the number of rows is the total number of flowers in the dataset. In general, we will refer to the rows of the matrix as samples, and the number of rows as n_samples.

Likewise, each column of the data refers to a particular quantitative piece of information that describes each sample. In general, we will refer to the columns of the matrix as features, and the number of columns as n_features.

matrice delle caratteristiche
This table layout makes clear that the information can be thought of as a two-dimensional numerical array or matrix, which we will call the features matrix. By convention, this features matrix is often stored in a variable named X. The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features], and is most often contained in a NumPy array or a Pandas DataFrame, though some Scikit-Learn models also accept SciPy sparse matrices.

The samples (i.e., rows) always refer to the individual objects described by the dataset. For example, the sample might be a flower, a person, a document, an image, a sound file, a video, an astronomical object, or anything else you can describe with a set of quantitative measurements.

The features (i.e., columns) always refer to the distinct observations that describe each sample in a quantitative manner. Features are generally real-valued, but may be Boolean or discrete-valued in some cases.

array target
In addition to the feature matrix X, we also generally work with a label or target array, which by convention we will usually call y. The target array is usually one dimensional, with length n_samples, and is generally contained in a NumPy array or Pandas Series. The target array may have continuous numerical values, or discrete classes/labels. While some Scikit-Learn estimators do handle multiple target values in the form of a two-dimensional, [n_samples, n_targets] target array, we will primarily be working with the common case of a one-dimensional target array.

Often one point of confusion is how the target array differs from the other features columns. The distinguishing feature of the target array is that it is usually the quantity we want to predict from the data: in statistical terms, it is the dependent variable. For example, in the preceding data we may wish to construct a model that can predict the species of flower based on the other measurements; in this case, the species column would be considered the target array.

With this target array in mind, we can use Seaborn (see Visualization With Seaborn [qui]) to conveniently visualize the data:

For use in Scikit-Learn, we will extract the features matrix and target array from the DataFrame, which we can do using some of the Pandas DataFrame operations discussed in the  Chapter 3  [nei posts a partire da questo]:

To summarize, the expected layout of features and target values is visualized in the following diagram:

import seaborn as sns; sns.set()
iris = sns.load_dataset('iris')
sns.pairplot(iris, hue='species', size=1.5);

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(6, 4))
ax = fig.add_axes([0, 0, 1, 1])
ax.axis('off')
ax.axis('equal')

# Draw features matrix
ax.vlines(range(6), ymin=0, ymax=9, lw=1)
ax.hlines(range(10), xmin=0, xmax=5, lw=1)
font_prop = dict(size=12, family='monospace')
ax.text(-1, -1, "Feature Matrix ($X$)", size=14)
ax.text(0.1, -0.3, r'n_features $\longrightarrow$', **font_prop)
ax.text(-0.1, 0.1, r'$\longleftarrow$ n_samples', rotation=90,
        va='top', ha='right', **font_prop)

# Draw labels vector
ax.vlines(range(8, 10), ymin=0, ymax=9, lw=1)
ax.hlines(range(10), xmin=8, xmax=9, lw=1)
ax.text(7, -1, "Target Vector ($y$)", size=14)
ax.text(7.9, 0.1, r'$\longleftarrow$ n_samples', rotation=90,
        va='top', ha='right', **font_prop)

ax.set_ylim(10, -2)

fig.savefig('np883.png')

With this data properly formatted, we can move on to consider the estimator API of Scikit-Learn, nel prossimo post 😁

:mrgreen:

JavaScript 52 – espressioni regolari – 4

Continuo da qui, copio qui.

Limiti per parole e stringhe
Unfortunately, findDate [post precedente] will also happily extract the nonsensical date 00-1-3000 from the string "100-1-30000". A match may happen anywhere in the string, so in this case, it’ll just start at the second character and end at the second-to-last character.

If we want to enforce that the match must span the whole string, we can add the markers ^ and $. The caret matches the start of the input string, while the dollar sign matches the end. So, /^\d+$/ matches a string consisting entirely of one or more digits, /^!/ matches any string that starts with an exclamation mark, and /x^/ does not match any string (there cannot be an x before the start of the string).

If, on the other hand, we just want to make sure the date starts and ends on a word boundary, we can use the marker \b. A word boundary can be the start or end of the string or any point in the string that has a word character (as in \w) on one side and a nonword character on the other (file re17.js).

console.log(/cat/.test("concatenate"));
console.log(/\bcat\b/.test("concatenate"));

Note that a boundary marker doesn’t represent an actual character. It just enforces that the regular expression matches only when a certain condition holds at the place where it appears in the pattern.

Modelli di scelta
Say we want to know whether a piece of text contains not only a number but a number followed by one of the words pig, cow, or chicken, or any of their plural forms.

We could write three regular expressions and test them in turn, but there is a nicer way. The pipe character (|) denotes a choice between the pattern to its left and the pattern to its right. So I can say this (re18.js):

var animalCount = /\b\d+ (pig|cow|chicken)s?\b/;
console.log(animalCount.test("15 pigs"));
console.log(animalCount.test("15 pigchickens"));

lo sanno tutti che non esistono pollimaialici!

Parentheses can be used to limit the part of the pattern that the pipe operator applies to, and you can put multiple such operators next to each other to express a choice between more than two patterns.

Problema mio: così il post è corto ma l’argomento successivo lo renderebbe decisamente troppo lungo. Allora pausa 😜

:mrgreen:

NumPy – 91 – cos’è il machine learning – 3

Continuo da qui, copio qui.

clustering: trovare le etichette dai dati
The classification and regression illustrations we just looked at are examples of supervised learning algorithms, in which we are trying to build a model that will predict labels for new data. Unsupervised learning involves models that describe data without reference to any known labels.

One common case of unsupervised learning is “clustering,” in which data is automatically assigned to some number of discrete groups. For example, we might have some two-dimensional data like that shown in the following figure:

import matplotlib.pyplot as plt

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

# create 50 separable points
X, y = make_blobs(n_samples=100, centers=4,
                  random_state=42, cluster_std=1.5)

# Fit the K Means model
model = KMeans(4, random_state=0)
y = model.fit_predict(X)


# plot the input data
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(X[:, 0], X[:, 1], s=50, color='gray')

# format the plot
format_plot(ax, 'Input Data')

fig.savefig('np875.png')

By eye, it is clear that each of these points is part of a distinct group. Given this input, a clustering model will use the intrinsic structure of the data to determine which points are related. Using the very fast and intuitive k-means algorithm (see In Depth: K-Means Clustering [prossimamente]), we find the clusters shown in the following figure:

import matplotlib.pyplot as plt

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

# create 50 separable points
X, y = make_blobs(n_samples=100, centers=4,
                  random_state=42, cluster_std=1.5)

# Fit the K Means model
model = KMeans(4, random_state=0)
y = model.fit_predict(X)

# plot the data with cluster labels
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(X[:, 0], X[:, 1], s=50, c=y, cmap='viridis')

# format the plot
format_plot(ax, 'Learned Cluster Labels')

fig.savefig('np876.png')

k-means fits a model consisting of k cluster centers; the optimal centers are assumed to be those that minimize the distance of each point from its assigned center. Again, this might seem like a trivial exercise in two dimensions, but as our data becomes larger and more complex, such clustering algorithms can be employed to extract useful information from the dataset.

We will discuss the k-means algorithm in more depth in In Depth: K-Means Clustering [prossimamente]. Other important clustering algorithms include Gaussian mixture models (See In Depth: Gaussian Mixture Models [prossimamente]) and spectral clustering (See Scikit-Learn’s clustering documentation).

Riduzione della dimensionalità: inferire la struttura dei dati
Dimensionality reduction is another example of an unsupervised algorithm, in which labels or other information are inferred from the structure of the dataset itself. Dimensionality reduction is a bit more abstract than the examples we looked at before, but generally it seeks to pull out some low-dimensional representation of data that in some way preserves relevant qualities of the full dataset. Different dimensionality reduction routines measure these relevant qualities in different ways, as we will see in In-Depth: Manifold Learning [prossimamente].

As an example of this, consider the data shown in the following figure:

import matplotlib.pyplot as plt

from sklearn.datasets import make_swiss_roll

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

# make data
X, y = make_swiss_roll(200, noise=0.5, random_state=42)
X = X[:, [0, 2]]

# visualize data
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], color='gray', s=30)

# format the plot
format_plot(ax, 'Input Data')

fig.savefig('np877.png')

Visually, it is clear that there is some structure in this data: it is drawn from a one-dimensional line that is arranged in a spiral within this two-dimensional space. In a sense, you could say that this data is “intrinsically” only one dimensional, though this one-dimensional data is embedded in higher-dimensional space. A suitable dimensionality reduction model in this case would be sensitive to this nonlinear embedded structure, and be able to pull out this lower-dimensionality representation.

The following figure shows a visualization of the results of the Isomap algorithm, a manifold learning algorithm that does exactly this:

import matplotlib.pyplot as plt

from sklearn.datasets import make_swiss_roll
from sklearn.manifold import Isomap

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

# make data
X, y = make_swiss_roll(200, noise=0.5, random_state=42)
X = X[:, [0, 2]]

model = Isomap(n_neighbors=8, n_components=1)
y_fit = model.fit_transform(X).ravel()

# visualize data
fig, ax = plt.subplots()
pts = ax.scatter(X[:, 0], X[:, 1], c=y_fit, cmap='viridis', s=30)
cb = fig.colorbar(pts, ax=ax)

# format the plot
format_plot(ax, 'Learned Latent Parameter')
cb.set_ticks([])
cb.set_label('Latent Variable', color='gray')

fig.savefig('np878.png')

Notice that the colors (which represent the extracted one-dimensional latent variable) change uniformly along the spiral, which indicates that the algorithm did in fact detect the structure we saw by eye. As with the previous examples, the power of dimensionality reduction algorithms becomes clearer in higher-dimensional cases. For example, we might wish to visualize important relationships within a dataset that has 100 or 1,000 features. Visualizing 1,000-dimensional data is a challenge, and one way we can make this more manageable is to use a dimensionality reduction technique to reduce the data to two or three dimensions.

Some important dimensionality reduction algorithms that we will discuss are principal component analysis (see In Depth: Principal Component Analysis [prossimamente]) and various manifold learning algorithms, including Isomap and locally linear embedding (See In-Depth: Manifold Learning [prossimamente]).

:mrgreen:

SICP – cap. 2 – Operazioni in sequenza – 41

Continuo da qui, copio qui.

The key to organizing programs so as to more clearly reflect the signal-flow structure is to concentrate on the “signals” that flow from one stage in the process to the next. If we represent these signals as lists, then we can use list operations to implement the processing at each of the stages. For instance, we can implement the mapping stages of the signal-flow diagrams using the map procedure from 2.2.1 [qui]:

Filtering a sequence to select only those elements that satisfy a given predicate is accomplished by

(define (filter predicate sequence)
  (cond ((null? sequence) nil)
        ((predicate (car sequence))
         (cons (car sequence)
               (filter predicate 
                       (cdr sequence))))
        (else  (filter predicate 
                       (cdr sequence)))))

For example,

ho usato filter predefinito in Racket ma non ditelo a nessuno, nèh!

Accumulations can be implemented by

(define (accumulate op initial sequence)
  (if (null? sequence)
      initial
      (op (car sequence)
          (accumulate op 
                      initial 
                      (cdr sequence)))))

, il solito nil, con Racket dovrei ricordarmi di usare null.

All that remains to implement signal-flow diagrams is to enumerate the sequence of elements to be processed. For even-fibs, we need to generate the sequence of integers in a given range, which we can do as follows:

(define (enumerate-interval low high)
  (if (> low high)
      nil
      (cons low 
            (enumerate-interval 
             (+ low 1) 
             high))))

To enumerate the leaves of a tree, we can use (Note:  This is, in fact, precisely the fringe procedure from Exercise 2.28 [qui]. Here we’ve renamed it to emphasize that it is part of a family of general sequence-manipulation procedures.):

(define (enumerate-tree tree)
  (cond ((null? tree) nil)
        ((not (pair? tree)) (list tree))
        (else (append 
               (enumerate-tree (car tree))
               (enumerate-tree (cdr tree))))))

Now we can reformulate sum-odd-squares and even-fibs [post precedente] as in the signal-flow diagrams. For sum-odd-squares, we enumerate the sequence of leaves of the tree, filter this to keep only the odd numbers in the sequence, square each element, and sum the results:

(define (sum-odd-squares tree)
  (accumulate 
   +
   0
   (map square
        (filter odd?
                (enumerate-tree tree)))))

For even-fibs, we enumerate the integers from 0 to n, generate the Fibonacci number for each of these integers, filter the resulting sequence to keep only the even elements, and accumulate the results into a list:

(define (even-fibs n)
  (accumulate 
   cons
   nil
   (filter even?
           (map fib
                (enumerate-interval 0 n)))))

The value of expressing programs as sequence operations is that this helps us make program designs that are modular, that is, designs that are constructed by combining relatively independent pieces. We can encourage modular design by providing a library of standard components together with a conventional interface for connecting the components in flexible ways.

Modular construction is a powerful strategy for controlling complexity in engineering design. In real signal-processing applications, for example, designers regularly build systems by cascading elements selected from standardized families of filters and transducers. Similarly, sequence operations provide a library of standard program elements that we can mix and match. For instance, we can reuse pieces from the sum-odd-squares and even-fibs procedures in a program that constructs a list of the squares of the first n + 1 Fibonacci numbers:

(define (list-fib-squares n)
  (accumulate 
   cons
   nil
   (map square
        (map fib
             (enumerate-interval 0 n)))))

We can rearrange the pieces and use them in computing the product of the squares of the odd integers in a sequence:

(define 
  (product-of-squares-of-odd-elements
   sequence)
  (accumulate 
   *
   1
   (map square (filter odd? sequence))))

We can also formulate conventional data-processing applications in terms of sequence operations. Suppose we have a sequence of personnel records and we want to find the salary of the highest-paid programmer. Assume that we have a selector salary that returns the salary of a record, and a predicate programmer? that tests if a record is for a programmer. Then we can write

(define 
  (salary-of-highest-paid-programmer
   records)
  (accumulate 
   max
   0
   (map salary
        (filter programmer? records))))

These examples give just a hint of the vast range of operations that can be expressed as sequence operations.

Note: Richard Waters (1979) developed a program that automatically analyzes traditional Fortran programs, viewing them in terms of maps, filters, and accumulations. He found that fully 90 percent of the code in the Fortran Scientific Subroutine Package fits neatly into this paradigm. One of the reasons for the success of Lisp as a programming language is that lists provide a standard medium for expressing ordered collections so that they can be manipulated using higher-order operations. The programming language APL owes much of its power and appeal to a similar choice. In APL all data are represented as arrays, and there is a universal and convenient set of generic operators for all sorts of array operations.

Sequences, implemented here as lists, serve as a conventional interface that permits us to combine processing modules. Additionally, when we uniformly represent structures as sequences, we have localized the data-structure dependencies in our programs to a small number of sequence operations. By changing these, we can experiment with alternative representations of sequences, while leaving the overall design of our programs intact. We will exploit this capability  in 3.5  [prossimamente], when we generalize the sequence-processing paradigm to admit infinite sequences.

:mrgreen:

NumPy – 90 – cos’è il machine learning – 2

Continuo da qui, copio qui.

regressione: predire etichette continue
In contrast with the discrete labels of a classification algorithm, we will next look at a simple regression task in which the labels are continuous quantities.

Consider the data shown in the following figure, which consists of a set of points each with a continuous label:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

# Create some data for the regression
rng = np.random.RandomState(1)

X = rng.randn(200, 2)
y = np.dot(X, [-2, 1]) + 0.1 * rng.randn(X.shape[0])

# fit the regression model
model = LinearRegression()
model.fit(X, y)

# create some new points to predict
X2 = rng.randn(100, 2)

# predict the labels
y2 = model.predict(X2)

# plot data points
fig, ax = plt.subplots()
points = ax.scatter(X[:, 0], X[:, 1], c=y, s=50,
                    cmap='viridis')

# format plot
format_plot(ax, 'Input Data')
ax.axis([-4, 4, -3, 3])

fig.savefig('np871.png')

As with the classification example, we have two-dimensional data: that is, there are two features describing each data point. The color of each point represents the continuous label for that point.

There are a number of possible regression models we might use for this type of data, but here we will use a simple linear regression to predict the points. This simple linear regression model assumes that if we treat the label as a third spatial dimension, we can fit a plane to the data. This is a higher-level generalization of the well-known problem of fitting a line to data with two coordinates.

We can visualize this setup as shown in the following figure:

import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d.art3d import Line3DCollection

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

# Create some data for the regression
rng = np.random.RandomState(1)

X = rng.randn(200, 2)
y = np.dot(X, [-2, 1]) + 0.1 * rng.randn(X.shape[0])

points = np.hstack([X, y[:, None]]).reshape(-1, 1, 3)
segments = np.hstack([points, points])
segments[:, 0, 2] = -8

# plot points in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], y, c=y, s=35,
           cmap='viridis')
ax.add_collection3d(Line3DCollection(segments, colors='gray', alpha=0.2))
ax.scatter(X[:, 0], X[:, 1], -8 + np.zeros(X.shape[0]), c=y, s=10,
           cmap='viridis')

# format plot
ax.patch.set_facecolor('white')
ax.view_init(elev=20, azim=-70)
ax.set_zlim3d(-8, 8)
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.yaxis.set_major_formatter(plt.NullFormatter())
ax.zaxis.set_major_formatter(plt.NullFormatter())
ax.set(xlabel='feature 1', ylabel='feature 2', zlabel='label')

# Hide axes (is there a better way?)
ax.w_xaxis.line.set_visible(False)
ax.w_yaxis.line.set_visible(False)
ax.w_zaxis.line.set_visible(False)
for tick in ax.w_xaxis.get_ticklines():
    tick.set_visible(False)
for tick in ax.w_yaxis.get_ticklines():
    tick.set_visible(False)
for tick in ax.w_zaxis.get_ticklines():
    tick.set_visible(False)

fig.savefig('np872.png')

Notice that the feature 1feature 2 plane here is the same as in the two-dimensional plot from before; in this case, however, we have represented the labels by both color and three-dimensional axis position. From this view, it seems reasonable that fitting a plane through this three-dimensional data would allow us to predict the expected label for any set of input parameters. Returning to the two-dimensional projection, when we fit such a plane we get the result shown in the following figure:

import numpy as np
import matplotlib.pyplot as plt

from matplotlib.collections import LineCollection
from sklearn.linear_model import LinearRegression

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

# Create some data for the regression
rng = np.random.RandomState(1)

X = rng.randn(200, 2)
y = np.dot(X, [-2, 1]) + 0.1 * rng.randn(X.shape[0])

# fit the regression model
model = LinearRegression()
model.fit(X, y)

points = np.hstack([X, y[:, None]]).reshape(-1, 1, 3)
segments = np.hstack([points, points])
segments[:, 0, 2] = -8

# plot data points
fig, ax = plt.subplots()
pts = ax.scatter(X[:, 0], X[:, 1], c=y, s=50,
                 cmap='viridis', zorder=2)

# compute and plot model color mesh
xx, yy = np.meshgrid(np.linspace(-4, 4),
                     np.linspace(-3, 3))
Xfit = np.vstack([xx.ravel(), yy.ravel()]).T
yfit = model.predict(Xfit)
zz = yfit.reshape(xx.shape)
ax.pcolorfast([-4, 4], [-3, 3], zz, alpha=0.5,
              cmap='viridis', norm=pts.norm, zorder=1)

# format plot
format_plot(ax, 'Input Data with Linear Fit')
ax.axis([-4, 4, -3, 3])

fig.savefig('np873.png')

This plane of fit gives us what we need to predict labels for new points. Visually, we find the results shown in the following figure:

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets.samples_generator import make_blobs
from sklearn.svm import SVC
from matplotlib.collections import LineCollection

# common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='gray')
    ax.set_ylabel('feature 2', color='gray')
    ax.set_title(title, color='gray')

# Create some data for the regression
rng = np.random.RandomState(1)

X = rng.randn(200, 2)
y = np.dot(X, [-2, 1]) + 0.1 * rng.randn(X.shape[0])

# fit the regression model
model = LinearRegression()
model.fit(X, y)

# create some new points to predict
X2 = rng.randn(100, 2)

# predict the labels
y2 = model.predict(X2)

# plot data points
fig, ax = plt.subplots()
pts = ax.scatter(X[:, 0], X[:, 1], c=y, s=50,
                 cmap='viridis', zorder=2)
# compute and plot model color mesh
xx, yy = np.meshgrid(np.linspace(-4, 4),
                     np.linspace(-3, 3))
Xfit = np.vstack([xx.ravel(), yy.ravel()]).T
yfit = model.predict(Xfit)
zz = yfit.reshape(xx.shape)
ax.pcolorfast([-4, 4], [-3, 3], zz, alpha=0.5,
              cmap='viridis', norm=pts.norm, zorder=1)
# format plot
format_plot(ax, 'Input Data with Linear Fit')
ax.axis([-4, 4, -3, 3])

# plot the model fit
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

ax[0].scatter(X2[:, 0], X2[:, 1], c='gray', s=50)
ax[0].axis([-4, 4, -3, 3])

ax[1].scatter(X2[:, 0], X2[:, 1], c=y2, s=50,
              cmap='viridis', norm=pts.norm)
ax[1].axis([-4, 4, -3, 3])

# format plots
format_plot(ax[0], 'Unknown Data')
format_plot(ax[1], 'Predicted Labels')

fig.savefig('np874.png')

As with the classification example, this may seem rather trivial in a low number of dimensions. But the power of these methods is that they can be straightforwardly applied and evaluated in the case of data with many, many features.

For example, this is similar to the task of computing the distance to galaxies observed through a telescope—in this case, we might use the following features and labels:

  • feature 1, feature 2, etc. → brightness of each galaxy at one of several wave lengths or colors
  • label → distance or redshift of the galaxy

The distances for a small number of these galaxies might be determined through an independent set of (typically more expensive) observations. Distances to remaining galaxies could then be estimated using a suitable regression model, without the need to employ the more expensive observation across the entire set. In astronomy circles, this is known as the “photometric redshift” problem.

Some important regression algorithms that we will discuss are linear regression (see [i prossimi posts]).

:mrgreen: