NumPy – 58 – operazioni con stringhe vettorializzate – 1

Continuo da qui, copio qui.

One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we’ll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

Introuzione alle operazioni sulle stringhe di Pandas
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you’re stuck using a more verbose loop syntax:

This is perhaps sufficient to work with some data, but it will break if there are any missing values. For example:

e pensa te che mio nonno era di None (TO) 😜

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the str attribute of Pandas Series and Index objects containing strings. So, for example, suppose we create a Pandas Series with this data:

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

Using tab completion on this str attribute will list all the vectorized string methods available to Pandas

Tabella dei metodi di Pandas per le stringhe
If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it’s probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the following series of names:

metodi simili ai metodi di Python per le stringhe
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:

len()     lower()      translate()  islower()
ljust()   upper()      startswith() isupper()
rjust()   find()       endswith()   isnumeric()
center()  rfind()      isalnum()    isdecimal()
zfill()   index()      isalpha()    split()
strip()   rindex()     isdigit()    rsplit()
rstrip()  capitalize() isspace()    partition()
lstrip()  swapcase()   istitle()    rpartition()

Notice that these have various return values. Some, like lower(), return a series of strings:

But some others return numbers:

Or Boolean values:

Still others return lists or other compound values for each element:

We’ll see further manipulations of this kind of series-of-lists object as we continue our discussion.

metodi usanti espressioni regolari
In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module:

Method     Description
match()    Call re.match() on each element, returning a boolean.
extract()  Call re.match() on each element, returning matched groups as strings.
findall()  Call re.findall() on each element
replace()  Replace occurrences of pattern with some other string
contains() Call re.search() on each element, returning a boolean
count()    Count occurrences of pattern
split()    Equivalent to str.split(), but accepts regexps
rsplit()   Equivalent to str.rsplit(), but accepts regexps

With these, you can do a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (^) and end-of-string ($) regular expression characters:

The ability to concisely apply regular expressions across Series or Dataframe entries opens up many possibilities for analysis and cleaning of data.

metodi miscellanei
Finally, there are some miscellaneous methods that enable other convenient operations:

Method          Description
get()           Index each element
slice()         Slice each element
slice_replace() Replace slice in each element with passed value
cat()           Concatenate strings
repeat()        Repeat values
normalize()     Return Unicode form of string
pad()           Add whitespace to left, right, or both sides of strings
wrap()          Split long strings into lines with length less than a given width
join()          Join strings in each element of the Series with passed separator
get_dummies()   extract dummy variables as a dataframe

accedere e suddividere elementi vettorializzati
The get() and slice() operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note that this behavior is also available through Python’s normal indexing syntax–for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:

Indexing via df.str.get(i) and df.str[i] is likewise similar.

These get() and slice() methods also let you access elements of arrays returned by split(). For example, to extract the last name of each entry, we can combine split() and get():

variabili indicatrici
Another method that requires a bit of extra explanation is the get_dummies() method. This is useful when your data has a column containing some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A="born in America", B="born in the United Kingdom", C="likes cheese", D="likes spam":

The get_dummies() routine lets you quickly split-out these indicator variables into a DataFrame:

With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

We won’t dive further into these methods here, but I encourage you to read through “Working with Text Data” in the Pandas online documentation, or to refer to the resources listed in Further Resources [prossimamente].

:mrgreen:

Advertisements
Post a comment or leave a trackback: Trackback URL.

Trackbacks

Rispondi

Inserisci i tuoi dati qui sotto o clicca su un'icona per effettuare l'accesso:

Logo WordPress.com

Stai commentando usando il tuo account WordPress.com. Chiudi sessione / Modifica )

Foto Twitter

Stai commentando usando il tuo account Twitter. Chiudi sessione / Modifica )

Foto di Facebook

Stai commentando usando il tuo account Facebook. Chiudi sessione / Modifica )

Google+ photo

Stai commentando usando il tuo account Google+. Chiudi sessione / Modifica )

Connessione a %s...

%d blogger hanno fatto clic su Mi Piace per questo: