Visto nel Web – 284

Come ogni domenica ecco cosa ho wisto nel Web.

Why C++ (is fun to me): template metaprogramming
#:linguaggi di programmazione
::: ThePracticalDev

The future of ad blocking
#:Web, Internet
::: Donearm

How to use Twitter Lite as a Desktop Twitter Client
#:social media
::: lucaciavatta

Techniques for Efficiently Learning Programming Languages
#:programming, codice, snippet
::: nonrecursive

The Future of Deep Learning
#:programming, codice, snippet
::: ThePracticalDev

Delegating to others the power to tell truth from fiction is a risky business
#:bufale
::: RobinGood

The 10 most popular programming languages since 2002
#:linguaggi di programmazione
::: MIT_CSAIL

Universe, a mobile-only website builder, lets you create pages in ‘under a minute’
#:tools, componenti software #:dispositivi mobili
::: valeriopel

Data Scientists and Software Engineering
sì, ma non tutti lo sanno
#:programming, codice, snippet
::: ThePracticalDev

this “statistics for hackers” slide deck by @jakevdp is a really great intro to stats for programmers
#:manuali, how to
altamente raccomandato; con Julia e Jake 😁
::: b0rk

Hollywood Is Losing the Battle Against Online Trolls
#:Web, Internet
::: Slashdot

How Instant Articles turned into another Facebook bait-and-switch
#:ditte
::: fabiochiusi ::: fabiochiusi

AI Can Predict Heart Attacks More Accurately Than Doctors
#:artificial intelligence
::: Slashdot

Why Twitter Is Failing to Grow
#:social media
::: Donearm

Grok the GIL: How to write fast and thread-safe Python
#:linguaggi di programmazione
::: lucaciavatta

Netflix Nears 100 Million Subscribers
#:media
::: Slashdot

Cybersicurezza, 92% italiani si crede immune dagli hacker
#:sicurezza, spionaggio, virus
::: fabiochiusi

Kudos to Steve Ballmer’s wife for convincing him to do this
#:protagonisti
non come nella vita precedente, a M$
::: gknauth

Training and re-training
Bill Gates Is Wrong: The Solution to AI Taking Jobs Is Training, Not Taxes
#:innovazioni, futuro
::: trekkinglemon

How @timberners_lee, the web’s inventor, is making sure it doesn’t become another vector for inequality
#:protagonisti #:Web, Internet
::: TheEconomist

Fonte? Quali studi?
gli old media le provano tutte
#:bufale
::: fabiochiusi ::: epicariello

A comparison matrix between Python data validation tools has been started by @funkyfuture
#:linguaggi di programmazione
::: nicolaiarocci

Now you can do `raco pkg install furtle` to use the library and draw stuffs like this with #racketlang
#:lisp(s)
::: sourav_datta

stuff still my favorite
#:linguaggi di programmazione
::: WebReflection

Ubuntu Is Switching to Wayland by Default
#:sistemi operativi
::: dcavedon

The new Nim website is now live! Check it out and tell us what you think
avendo tempo questo sarebbe da approfondire
#:linguaggi di programmazione
::: nim_lang

StarCraft Is Now Free, Nearly 20 Years After Its Release
#:games
::: Slashdot

The History of Computer RPGs (450 page preview)
#:storia #:games
::: Donearm

Facebook’s Perfect, Impossible Chatbot – Facebook is quietly trying to develop the most useful virtual assistant
#:artificial intelligence #:social media
::: jstorres

Meet The Man Who Makes Facebook’s Machines Think
#:artificial intelligence #:social media
::: fabiochiusi

Coraggio, acerrimi nemici dell’anonimato online italiani, avete un nuovo amico: Vladimir Putin
#:sicurezza, spionaggio, virus
::: fabiochiusi

Pirate Bay Founder Launches Anonymous Domain Registration Service
#:sicurezza, spionaggio, virus
::: Slashdot ::: timberners_lee

What to Expect From The $89 Pinebook Laptop
#:hardware
::: dcavedon

This is not the performance you were looking for
#:programming, codice, snippet
::: yminsky

some good ‘statistics for programmers’ resources
#:manuali, how to
::: b0rk

Best practice with retries with requests
#:programming, codice, snippet
::: Peter Bengtsson

A court controversially ruled that a mobile phone caused a tumour
#:dispositivi mobili
::: SAI

Which programming languages are used most late at night?
#:linguaggi di programmazione
::: MIT_CSAIL

Apple Forces Recyclers To Shred All iPhones and MacBooks
#:ditte
::: Slashdot

Canada Rules To Uphold Net Neutrality
#:Web, Internet
::: Slashdot

System76 is going to be designing and manufacturing their Linux-based hardware in-house
#:hardware
::: Linux News Site

The US Charging Julian Assange Could Put Press Freedom on Trial
#:sicurezza, spionaggio, virus
::: fabiochiusi ::: fabiochiusi ::: Snowden

To The Point Font
The *free* Point was initially created by handwritten pen and was then finalized into vector format for font usage
#:tools, componenti software
::: 1001 Fonts

NYU grad student goes undercover in Chinese iPhone factory and it ain’t pretty
#:economia
::: doctorow

Teenage Hackers Motivated By Morality Not Money, Study Finds
#:programming, codice, snippet
::: Slashdot

Apple Hires Top Google Satellite Executives For New Hardware Team
#:ditte
::: Slashdot

EFF Says Google Chromebooks Are Still Spying On Students
#:sicurezza, spionaggio, virus
::: Slashdot

The Guardian pulls out of Facebook’s Instant Articles and Apple News – Digiday
#:media
::: fabiochiusi

Is the Silicon Valley Dynasty Coming to an End?
#:economia
::: fabiochiusi

The cornerstones of democracy on the internet
#:Web, Internet
::: toholdaquill

BREAKING: Human increasingly replace robots
#:innovazioni, futuro
::: AntonioCasilli

Tradotto: responsabilizzazione degli intermediari per i contenuti postati dagli utenti
#:censura
::: fabiochiusi

I telefoni #cellulari causano il cancro?
adesso lo tardellazio 😜
#:innovazioni, futuro
::: RadioProzac

systemd -free Devuan Linux hits version 1.0.0
#:sistemi operativi
::: NoticiaLinux

WikiLeaks Releases New CIA Secret: Tapping Microphones On Some Samsung TVs
#:sicurezza, spionaggio, virus
::: Slashdot

JavaScript 29 – funzioni di ordine superiore – 5

Continuo da qui, copio qui.

Sempre esercizi, oggi…

Differenza d’età madre-figlio
Using the example data set from this chapter, compute the average age difference between mothers and children (the age of the mother when the child is born). You can use the average function defined earlier in this chapter.

Note that not all the mothers mentioned in the data are themselves present in the array. The byName object, which makes it easy to find a person’s object from their name, might be useful here.

function average(array) {
  function plus(a, b) { return a + b; }
  return array.reduce(plus) / array.length;
}

var byName = {};
ancestry.forEach(function(person) {
  byName[person.name] = person;
});

Il problema consiste nel caricare i dati del file JSON fornito da Marijn ed estrarre quelli che servono all’elaborazione. Si può fare ma non è semplice; in alternativa si può andare qui.
Questo è il codice di Marijn.

function average(array) {  
  function plus(a, b) { 
    return a + b; 
  }
  return array.reduce(plus) / array.length;
}

var byName = {};
ancestry.forEach(function(person) {
  byName[person.name] = person;
});

var differences = ancestry.filter(function(person) {
  return byName[person.mother] != null;
}).map(function(person) {
  return person.born - byName[person.mother].born;
});

Per farlo girare con NodeJS sono necessari i soliti aggiustamenti, usando i files disponibili qui.

Sono sempre più dell’idea che esercizi di questo tipo sono troppo specifici, oltre lo scopo della serie che è quella di introduzione a JavaScript 😡

:mrgreen:

NumPy – 62 – lavorare con Series temporali – 3

Continuo da qui, copio qui.

Frequenze e offsets
Fundamental to these Pandas time series tools is the concept of a frequency or date offset. Just as we saw the D (day) and H (hour) codes above [post pecedente], we can use such codes to specify any desired frequency spacing. The following table summarizes the main codes available:

Code Description
D    Calendar day
W    Weekly   
M    Month end  
Q    Quarter end  
A    Year end  
H    Hours  
T    Minutes   
S    Seconds   
L    Milliseonds   
U    Microseconds   
N    nanoseconds   
B    Business day
BM   Business month end
BQ   Business quarter end
BA   Business year end
BH   Business hours

The monthly, quarterly, and annual frequencies are all marked at the end of the specified period. By adding an S suffix to any of these, they instead will be marked at the beginning:

Code Description
MS   Month start   
QS   Quarter start
AS   Year start   
BMS  Business month start
BQS  Business quarter start
BAS  Business year start

Additionally, you can change the month used to mark any quarterly or annual code by adding a three-letter month code as a suffix:

Q-JAN, BQ-FEB, QS-MAR, BQS-APR, etc.

A-JAN, BA-FEB, AS-MAR, BAS-APR, etc.

In the same way, the split-point of the weekly frequency can be modified by adding a three-letter weekday code:

W-SUN, W-MON, W-TUE, W-WED, etc.

On top of this, codes can be combined with numbers to specify other frequencies. For example, for a frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows:

All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the pd.tseries.offsets module. For example, we can create a business day offset directly as follows:

For more discussion of the use of frequencies and offsets, see the “DateOffset” section of the Pandas documentation.

:mrgreen:

JavaScript 28 – funzioni di ordine superiore – 4

Continuo da qui, copio qui.

Comincio con gli sercizi relativi al capitolo

Appiattire (flattening)
Use the reduce method in combination with the concat method to “flatten” an array of arrays into a single array that has all the elements of the input arrays.

var arrays = [[1, 2, 3], [4, 5], [6]];

Provo, ecco flat.js:

var arrays = [[1, 2, 3], [4, 5], [6]];

function flat(arr){
  return arr.reduce(function(a, b) {
    return a.concat(b);
  });
}

console.log(flat(arrays));

:mrgreen:

NumPy – 61 – lavorare con Series temporali – 2

Continuo da qui, copio qui.

Series Pandas per indicizzare con il tempo
Where the Pandas time series tools really become useful is when you begin to index data by timestamps. For example, we can construct a Series object that has time indexed data:

Now that we have this data in a Series, we can make use of any of the Series indexing patterns we discussed in previous sections, passing values that can be coerced into dates:

There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from that year:

Later, we will see additional examples of the convenience of dates-as-indices. But first, a closer look at the available time series data structures.

Strutture Pandas per Series di tempo
This section will introduce the fundamental Pandas data structures for working with time series data:

  • For time stamps, Pandas provides the Timestamp type. As mentioned before, it is essentially a replacement for Python’s native datetime, but is based on the more efficient numpy.datetime64 data type. The associated Index structure is DatetimeIndex.
  • For time Periods, Pandas provides the Period type. This encodes a fixed-frequency interval based on numpy.datetime64. The associated index structure is PeriodIndex.
  • For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a more efficient replacement for Python’s native datetime.timedelta type, and is based on numpy.timedelta64. The associated index structure is TimedeltaIndex.

The most fundamental of these date/time objects are the Timestamp and DatetimeIndex objects. While these class objects can be invoked directly, it is more common to use the pd.to_datetime() function, which can parse a wide variety of formats. Passing a single date to pd.to_datetime() yields a Timestamp; passing a series of dates by default yields a DatetimeIndex:

Nota: sono solo io o Jake non ha detto che ha importato datetime?

Any DatetimeIndex can be converted to a PeriodIndex with the to_period() function with the addition of a frequency code; here we’ll use ‘D’ to indicate daily frequency:

A TimedeltaIndex is created, for example, when a date is subtracted from another:

Sequenze regolari: pd.date_range()
To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose: pd.date_range() for timestamps, pd.period_range() for periods, and pd.timedelta_range() for time deltas. We’ve seen that Python’s range() and NumPy’s np.arange() turn a startpoint, endpoint, and optional stepsize into a sequence. Similarly, pd.date_range() accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates. By default, the frequency is one day:

Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods:

The spacing can be modified by altering the freq argument, which defaults to D. For example, here we will construct a range of hourly timestamps:

To create regular sequences of Period or Timedelta values, the very similar pd.period_range() and pd.timedelta_range() functions are useful. Here are some monthly periods:

And a sequence of durations increasing by an hour:

All of these require an understanding of Pandas frequency codes, which we’ll summarize in the next section.

:mrgreen:

SICP – cap. 2 – Strutture gerarchiche – 31 – esercizi

Continuo da qui, copio qui.

Exercise 2.25: Give combinations of cars and cdrs that will pick 7 from each of the following lists:

(1 3 (5 7) 9)
((7))
(1 (2 (3 (4 (5 (6 7))))))

Uhmmm…. spiazzante 😯 cioè so farlo (un po’ per tentativi per la terza) ma non so se è quello che mi viene chiesto; comunque ecco

Adesso corro a vedere i miei nerds di riferimento.

Bill the Lizard, sicp-ex e Drewiki. , OK –a parte che mi sono semplificato il compito usando quote globalmente per le liste date. Il problema mio è che applico cose che per SICP sono il futuro; sono troppo vecchio 😡

:mrgreen:

NumPy – 60 – lavorare con Series temporali – 1

Continuo da qui, copio qui.

Pandas was developed in the context of financial modeling, so as you might expect, it contains a fairly extensive set of tools for working with dates, times, and time-indexed data. Date and time data comes in a few flavors, which we will discuss here:

Time stamps reference particular moments in time (e.g., July 4th, 2015 at 7:00am).

Time intervals and periods reference a length of time between a particular beginning and end point; for example, the year 2015. Periods usually reference a special case of time intervals in which each interval is of uniform length and does not overlap (e.g., 24 hour-long periods comprising days).

Time deltas or durations reference an exact length of time (e.g., a duration of 22.56 seconds).

In this section, we will introduce how to work with each of these types of date/time data in Pandas. This short section is by no means a complete guide to the time series tools available in Python or Pandas, but instead is intended as a broad overview of how you as a user should approach working with time series. We will start with a brief discussion of tools for dealing with dates and times in Python, before moving more specifically to a discussion of the tools provided by Pandas. After listing some resources that go into more depth, we will review some short examples of working with time series data in Pandas.

Date e tempi in Python
The Python world has a number of available representations of dates, times, deltas, and timespans. While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.

Funzioni native di Python: datetime e dateutils
Python’s basic objects for working with dates and times reside in the built-in datetime module. Along with the third-party dateutil module, you can use it to quickly perform a host of useful functionalities on dates and times. For example, you can manually build a date using the datetime type:

Or, using the dateutil module, you can parse dates from a variety of string formats:

Once you have a datetime object, you can do things like printing the day of the week:

In the final line, we’ve used one of the standard string format codes for printing dates ("%A"), which you can read about in the strftime section of Python’s datetime documentation. Documentation of other useful date utilities can be found in dateutil‘s online documentation. A related package to be aware of is pytz, which contains tools for working with the most migrane-inducing piece of time series data: time zones.

The power of datetime and dateutil lie in their flexibility and easy syntax: you can use these objects and their built-in methods to easily perform nearly any operation you might be interested in. Where they break down is when you wish to work with large arrays of dates and times: just as lists of Python numerical variables are suboptimal compared to NumPy-style typed numerical arrays, lists of Python datetime objects are suboptimal compared to typed arrays of encoded dates.

Arrays tipizzati di tempo: datetime64 di NumPy
The weaknesses of Python’s datetime format inspired the NumPy team to add a set of native time series data type to NumPy. The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly. The datetime64 requires a very specific input format:

Once we have this date formatted, however, we can quickly do vectorized operations on it:

Because of the uniform type in NumPy datetime64 arrays, this type of operation can be accomplished much more quickly than if we were working directly with Python’s datetime objects, especially as arrays get large (we introduced this type of vectorization in Computation on NumPy Arrays: Universal Functions) [qui].

One detail of the datetime64 and timedelta64 objects is that they are built on a fundamental time unit. Because the datetime64 object is limited to 64-bit precision, the range of encodable times is 264 times this fundamental unit. In other words, datetime64 imposes a trade-off between time resolution and maximum time span.

For example, if you want a time resolution of one nanosecond, you only have enough information to encode a range of 264 nanoseconds, or just under 600 years. NumPy will infer the desired unit from the input; for example, here is a day-based datetime:

Here is a minute-based datetime:

Notice that the time zone is automatically set to the local time on the computer executing the code. You can force any desired fundamental unit using one of many format codes; for example, here we’ll force a nanosecond-based time:

The following table, drawn from the NumPy datetime64 documentation, lists the available format codes along with the relative and absolute timespans that they can encode:

Code Meaning     Time span (relative) Time span (absolute)
Y    Year        ± 9.2e18 years       [9.2e18 BC, 9.2e18 AD]
M    Month       ± 7.6e17 years       [7.6e17 BC, 7.6e17 AD]
W    Week        ± 1.7e17 years       [1.7e17 BC, 1.7e17 AD]
D    Day         ± 2.5e16 years       [2.5e16 BC, 2.5e16 AD]
h    Hour        ± 1.0e15 years       [1.0e15 BC, 1.0e15 AD]
m    Minute      ± 1.7e13 years       [1.7e13 BC, 1.7e13 AD]
s    Second      ± 2.9e12 years       [ 2.9e9 BC, 2.9e9 AD]
ms   Millisecond ± 2.9e9 years        [ 2.9e6 BC, 2.9e6 AD]
us   Microsecond ± 2.9e6 years        [290301 BC, 294241 AD]
ns   Nanosecond  ± 292 years          [ 1678 AD, 2262 AD]
ps   Picosecond  ± 106 days           [ 1969 AD, 1970 AD]
fs   Femtosecond ± 2.6 hours          [ 1969 AD, 1970 AD]
as   Attosecond  ± 9.2 seconds        [ 1969 AD, 1970 AD]

For the types of data we see in the real world, a useful default is datetime64[ns], as it can encode a useful range of modern dates with a suitably fine precision.

Finally, we will note that while the datetime64 data type addresses some of the deficiencies of the built-in Python datetime type, it lacks many of the convenient methods and functions provided by datetime and especially dateutil. More information can be found in NumPy’s datetime64 documentation.

Date e tempi in Pandas, il meglio di entrambi i mondi
Pandas builds upon all the tools just discussed to provide a Timestamp object, which combines the ease-of-use of datetime and dateutil with the efficient storage and vectorized interface of numpy.datetime64. From a group of these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in a Series or DataFrame; we’ll see many examples of this below.

For example, we can use Pandas tools to repeat the demonstration from above. We can parse a flexibly formatted string date, and use format codes to output the day of the week:

Additionally, we can do NumPy-style vectorized operations directly on this same object:

In the next section, we will take a closer look at manipulating time series data with the tools provided by Pandas.

:mrgreen:

cit. & loll – 40

Giovedì? eccolo qua 🍇

Shoutout to the Amazon box thieves who opened my box
::: h0h0h0

Without requirements or design , programming is the art of adding bugs to an empty text file
::: CodeWisdom

Attending a conference? Please carefully study this flow-chart and act accordingly
OK, ma bisogna ricordarlo?
::: jakevdp

The Final Fact-check
::: GoComics

This is all your fault
::: UniStudios

Io, per esempio, devo nascondere anche che non conosco il tedesco
::: pberndro

What really bugs me about all this “Windows is vulnerable”
::: hashbreaker

Another fun optical illusion in a few lines of matplotlib
prossimamente su questo stesso blog 😜
::: jakevdp

Simple doesn’t mean stupid
::: CodeWisdom

Tutto quel che succede online ogni 60 secondi
::: gretascl

Daughter: Dad, you know Binary Search Trees?
::: Google+

Verrà il Giorno del Giudizio, designers… Verrà…
::: madbob

Yeah, in fact, Haskell is a nice guy!
::: freeuniverser

Good code will be easily replaced, when necessary
::: ziobrando

Why not rebrand as a mere platform connecting diners & cooks
::: FrankPasquale

The problem with connecting everyone on the planet is that a lot of people are assholes
::: fabiochiusi

OMG. Schrödinger’s Linux/Unix directory
::: nixcraft

7!+1=71^2
c’è molto altro ma notevole anche per se 😎
::: wallingf

60 years ago today researchers ran the first FORTRAN program
ahemmm… ieri o cmq il 19 aprile
::: MIT_CSAIL

Programming is one of nature’s ways
::: wallingf

Relatività
::: nikitonsky

JavaScript 27 – funzioni di ordine superiore – 3

Continuo da qui, copio qui.

JSON
Higher-order functions that somehow apply a function to the elements of an array are widely used in JavaScript. The forEach method is the most primitive such function. There are a number of other variants available as methods on arrays. To familiarize ourselves with them, let’s play around with another data set.

A few years ago, someone crawled through a lot of archives and put together a book on the history of my family name (Haverbeke—meaning Oatbrook). I opened it hoping to find knights, pirates, and alchemists … but the book turns out to be mostly full of Flemish farmers. For my amusement, I extracted the information on my direct ancestors and put it into a computer-readable format.

The file I created looks something like this:

[
  {"name": "Emma de Milliano", "sex": "f",
   "born": 1876, "died": 1956,
   "father": "Petrus de Milliano",
   "mother": "Sophia van Damme"},
  {"name": "Carolus Haverbeke", "sex": "m",
   "born": 1832, "died": 1905,
   "father": "Carel Haverbeke",
   "mother": "Maria van Brussel"},
   ... and so on
]

This format is called JSON (pronounced “Jason”), which stands for JavaScript Object Notation. It is widely used as a data storage and communication format on the Web.

JSON is similar to JavaScript’s way of writing arrays and objects, with a few restrictions. All property names have to be surrounded by double quotes, and only simple data expressions are allowed—no function calls, variables, or anything that involves actual computation. Comments are not allowed in JSON.

JavaScript gives us functions, JSON.stringify and JSON.parse, that convert data to and from this format. The first takes a JavaScript value and returns a JSON-encoded string. The second takes such a string and converts it to the value it encodes (file JS0.js).

var string = JSON.stringify({name: "X", born: 1980});
console.log(string);
console.log(JSON.parse(string).born);

The variable ANCESTRY_FILE, available in the sandbox for this chapter and in a downloadable file on the website, contains the content of my JSON file as a string. Let’s decode it and see how many people it contains (JS1.js).

ANCESTRY_FILE = require('./ancestry.js');
// questo verrà spiegato prossimamente

var ancestry = JSON.parse(ANCESTRY_FILE);
console.log(ancestry.length);

Nota: modificato il codice; ho dovuto installare require.js da qui.

Filtrare un array
To find the people in the ancestry data set who were young in 1924, the following function might be helpful. It filters out the elements in an array that don’t pass a test (anc0.js).

ANCESTRY_FILE = require('./ancestry.js');
var ancestry = JSON.parse(ANCESTRY_FILE);

function filter(array, test) {
  var passed = [];
  for (var i = 0; i < array.length; i++) {
    if (test(array[i])) passed.push(array[i]); 
  } 
  return passed; 
} 
console.log(filter(ancestry, function(person) { 
  return person.born > 1900 && person.born < 1925;
}));

This uses the argument named test, a function value, to fill in a “gap” in the computation. The test function is called for each element, and its return value determines whether an element is included in the returned array.

Three people in the file were alive and young in 1924: my grandfather, grandmother, and great-aunt.

Note how the filter function, rather than deleting elements from the existing array, builds up a new array with only the elements that pass the test. This function is pure. It does not modify the array it is given.

Like forEach, filter is also a standard method on arrays. The example defined the function only in order to show what it does internally. From now on, we’ll use it like this instead:

function reduceAncestors(person, f, defaultValue) {
  function valueFor(person) {
    if (person == null)
      return defaultValue;
    else
      return f(person, valueFor(byName[person.mother]),
                       valueFor(byName[person.father]));
  }
  return valueFor(person);
}

The inner function (valueFor) handles a single person. Through the magic of recursion, it can simply call itself to handle the father and the mother of this person. The results, along with the person object itself, are passed to f, which returns the actual value for this person.

We can then use this to compute the amount of DNA my grandfather shared with Pauwels van Haverbeke and divide that by four.

function sharedDNA(person, fromMother, fromFather) {
  if (person.name == "Pauwels van Haverbeke")
    return 1;
  else
    return (fromMother + fromFather) / 2;
}
var ph = byName["Philibert Haverbeke"];
console.log(reduceAncestors(ph, sharedDNA, 0) / 4);

Nota: Marijn ha distribuito su tanti files le funzioni che gli servono, richiamandoli con funzioni non ancora viste e senza usare NodeJS. Risulta quindi inutilmente complesso eseguirle in node 👿

The person with the name Pauwels van Haverbeke obviously shared 100 percent of his DNA with Pauwels van Haverbeke (there are no people who share names in the data set), so the function returns 1 for him. All other people share the average of the amounts that their parents share.

So, statistically speaking, I share about 0.05 percent of my DNA with this 16th-century person. It should be noted that this is only a statistical approximation, not an exact amount. It is a rather small number, but given how much genetic material we carry (about 3 billion base pairs), there’s still probably some aspect in the biological machine that is me that originates with Pauwels.

We could also have computed this number without relying on reduceAncestors. But separating the general approach (condensing a family tree) from the specific case (computing shared DNA) can improve the clarity of the code and allows us to reuse the abstract part of the program for other cases. For example, the following code finds the percentage of a person’s known ancestors who lived past 70 (by lineage, so people may be counted multiple times):

function countAncestors(person, test) {
  function combine(current, fromMother, fromFather) {
    var thisOneCounts = current != person && test(current);
    return fromMother + fromFather + (thisOneCounts ? 1 : 0);
  }
  return reduceAncestors(person, combine, 0);
}
function longLivingPercentage(person) {
  var all = countAncestors(person, function(person) {
    return true;
  });
  var longLiving = countAncestors(person, function(person) {
    return (person.died - person.born) >= 70;
  });
  return longLiving / all;
}
console.log(longLivingPercentage(byName["Emile Haverbeke"]));

Such numbers are not to be taken too seriously, given that our data set contains a rather arbitrary collection of people. But the code illustrates the fact that reduceAncestors gives us a useful piece of vocabulary for working with the family tree data structure.

Collegamenti
The bind method, which all functions have, creates a new function that will call the original function but with some of the arguments already fixed.

The following code shows an example of bind in use. It defines a function isInSet that tells us whether a person is in a given set of strings. To call filter in order to collect those person objects whose names are in a specific set, we can either write a function expression that makes a call to isInSet with our set as its first argument or partially apply the isInSet function.

var theSet = ["Carel Haverbeke", "Maria van Brussel",
              "Donald Duck"];
function isInSet(set, person) {
  return set.indexOf(person.name) > -1;
}

console.log(ancestry.filter(function(person) {
  return isInSet(theSet, person);
}));
// → [{name: "Maria van Brussel", …},
//    {name: "Carel Haverbeke", …}]
console.log(ancestry.filter(isInSet.bind(null, theSet)));

si può espandere “...” ottenendo

The call to bind returns a function that will call isInSet with theSet as first argument, followed by any remaining arguments given to the bound function.

The first argument, where the example passes null, is used for method calls, similar to the first argument to apply. I’ll describe this in more detail in the next chapter.

Marijn (rockz! 🚀) ed io abbiamo idee diverse su come devono essere fatti gli esempi 😡

:mrgreen:

NumPy – 59 – operazioni con stringhe vettorializzate – 2

Continuo da qui, copio qui.

Esempio: un database di ricette
These vectorized string operations become most useful in the process of cleaning up messy, real-world data. Here I’ll walk through an example of that, using an open recipe database compiled from various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

The scripts used to compile this can be found [on GitHub], and the link to the current version of the database is found there as well.

As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:

OOPS! vuoto; googlando l’ho trovato qua.


The database is in JSON format, so we will try pd.read_json to read it:

Oops! We get a ValueError mentioning that there is “trailing data.” Searching for the text of this error on the Internet, it seems that it’s due to using a file in which each line is itself a valid JSON, but the full file is not. Let’s check if this interpretation is true:

Yes, apparently each line is a valid JSON, so we’ll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with pd.read_json:

We see there are nearly 200,000 recipes, and 17 columns. Let’s take a look at one row to see what we have:

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. In particular, the ingredient list is in string format; we’re going to have to carefully extract the information we’re interested in. Let’s start by taking a closer look at the ingredients:

The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!

Just out of curiousity, let’s see which recipe has the longest ingredient list:

That certainly looks like an involved recipe.

We can do other aggregate explorations; for example, let’s see how many of the recipes are for breakfast food:

Or how many of the recipes list cinnamon as an ingredient:

This is the type of essential data exploration that is possible with Pandas string tools. It is data munging like this that Python really excels at.

Un semplice suggeritore di ricette
Let’s go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients. While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row. So we will cheat a bit: we’ll start with a list of common ingredients, and simply search to see whether they are in each recipe’s ingredient list. For simplicity, let’s just stick with herbs and spices for the time being:

We can then build a Boolean DataFrame consisting of True and False values, indicating whether this ingredient appears in the list:

Now, as an example, let’s say we’d like to find a recipe that uses parsley, paprika, and tarragon. We can compute this very quickly using the query() method of DataFrames, discussed in High-Performance Pandas: eval() and query() [prossimamente]:

We find only 10 recipes with this combination; let’s use the index returned by this selection to discover the names of the recipes that have this combination:

Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we’d like to cook for dinner.

Avanti con le ricette!
Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods. Of course, building a very robust recipe recommendation system would require a lot more work! Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process. This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.

:mrgreen: