Profile picture

Pandas String Handling (.str Attribute)

Last updated: May 22nd, 20192019-05-22Project preview

rmotr


 Pandas String handling (.str attribute)

Cleaning text values can be incredibly hard. Invalid text values involves, 99% of the time, mistyping, which is completely unpredictable and doesn't follow any pattern. Thankfully, it's not so common these days, where data-entry tasks have been replaced by machines. Still, let's explore the most common cases:

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

green-divider

Splitting Columns

The result of a survey is loaded and this is what you get:

In [2]:
df = pd.DataFrame({
    'Data': [
        '1987_M_US _1',
        '1990?_M_UK_1',
        '1992_F_US_2',
        '1970?_M_   IT_1',
        '1985_F_I  T_2'
]})
In [3]:
df
Out[3]:
Data
0 1987_M_US _1
1 1990?_M_UK_1
2 1992_F_US_2
3 1970?_M_ IT_1
4 1985_F_I T_2

You know that the single columns represent the values "year, Sex, Country and number of children", but it's all been grouped in the same column and separated by an underscore. Pandas has a convenient method named split that we can use in these situations:

In [4]:
df['Data'].str.split('_')
Out[4]:
0       [1987, M, US , 1]
1       [1990?, M, UK, 1]
2        [1992, F, US, 2]
3    [1970?, M,    IT, 1]
4      [1985, F, I  T, 2]
Name: Data, dtype: object
In [5]:
df['Data'].str.split('_', expand=True)
Out[5]:
0 1 2 3
0 1987 M US 1
1 1990? M UK 1
2 1992 F US 2
3 1970? M IT 1
4 1985 F I T 2
In [6]:
df = df['Data'].str.split('_', expand=True)
In [7]:
df.columns = ['Year', 'Sex', 'Country', 'No Children']

You can also check which columns contain a given value with the contains method:

In [8]:
df
Out[8]:
Year Sex Country No Children
0 1987 M US 1
1 1990? M UK 1
2 1992 F US 2
3 1970? M IT 1
4 1985 F I T 2
In [9]:
df['Year'].str.contains('\?')
Out[9]:
0    False
1     True
2    False
3     True
4    False
Name: Year, dtype: bool

contains takes a regex/pattern as first value, so we need to escape the ? symbol as it has a special meaning for these patterns. Regular letters don't need escaping:

In [10]:
df['Country'].str.contains('U')
Out[10]:
0     True
1     True
2     True
3    False
4    False
Name: Country, dtype: bool

Removing blank spaces (like in 'US ' or 'I T' can be achieved with strip (lstrip and rstrip also exist) or just replace:

In [11]:
df['Country'].str.strip()
Out[11]:
0      US
1      UK
2      US
3      IT
4    I  T
Name: Country, dtype: object
In [12]:
df['Country'].str.replace(' ', '')
Out[12]:
0    US
1    UK
2    US
3    IT
4    IT
Name: Country, dtype: object

As we said, replace and contains take regex patterns, which can make it easier to replace values in bulk:

In [13]:
df['Year'].str.replace(r'(?P<year>\d{4})\?', lambda m: m.group('year'))
Out[13]:
0    1987
1    1990
2    1992
3    1970
4    1985
Name: Year, dtype: object

But, be warned:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

As you can see, all these string/text-related operations are applied over the str attribute of the series. That's because they have a special place in Series handling and you can read more about it here.

green-divider

Real example

Let's take a again to the Google Play Store Apps dataset.

We already fixed some invalid column types on previous lectures, but what about Installs and Price columns?

Both columns are string (object) type, because they have some special characters, but they really should be int and float types respectively.

In [14]:
apps = pd.read_csv('data/googleplaystore.csv', usecols=['App', 'Installs', 'Price'])

apps.head()
Out[14]:
App Installs Price
0 Photo Editor & Candy Camera & Grid & ScrapBook 10,000+ $1.99
1 Coloring book moana 500,000+ $1.99
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 5,000,000+ $1.99
3 Sketch - Draw & Paint 50,000,000+ $1.99
4 Pixel Draw - Number Art Coloring Book 100,000+ $1.99
In [15]:
apps.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 3 columns):
App         1025 non-null object
Installs    1025 non-null object
Price       1023 non-null object
dtypes: object(3)
memory usage: 24.1+ KB

green-divider

Fixing invalid types

The first thing we'll do is removing missing values:

In [16]:
apps.dropna(inplace=True)
In [17]:
apps.head()
Out[17]:
App Installs Price
0 Photo Editor & Candy Camera & Grid & ScrapBook 10,000+ $1.99
1 Coloring book moana 500,000+ $1.99
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 5,000,000+ $1.99
3 Sketch - Draw & Paint 50,000,000+ $1.99
4 Pixel Draw - Number Art Coloring Book 100,000+ $1.99

Now we'll need to do the following replacements within the string values:

  • Installs column: remove the + and , signs, then cast to int;
  • Price column: remove the $ and X signs, then cast to float;

 Installs

In [18]:
apps['Installs'].unique()
Out[18]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+'], dtype=object)
In [19]:
apps['Installs'] = apps['Installs'].str.replace('+', '')
apps['Installs'] = apps['Installs'].str.replace(',', '')
In [20]:
apps['Installs'] = apps['Installs'].astype('int')
In [21]:
apps['Installs'].unique()
Out[21]:
array([     10000,     500000,    5000000,   50000000,     100000,
            50000,    1000000,   10000000,       5000,  100000000,
       1000000000,       1000,  500000000,         50,        100,
              500,         10,          1,          5])
In [22]:
fig, ax = plt.subplots(figsize=(16, 6))

sns.distplot(apps['Installs'])
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f76db214358>

 Price

In [23]:
apps['Price'].unique()
Out[23]:
array(['$1.99 ', '$0 ', 'X', '$0.99 '], dtype=object)
In [24]:
apps['Price'] = apps['Price'].str.replace('$', '')
apps['Price'] = apps['Price'].str.replace('X', '0')
In [25]:
apps['Price'] = apps['Price'].astype('float')
In [26]:
apps['Price'].unique()
Out[26]:
array([1.99, 0.  , 0.99])
In [27]:
fig, ax = plt.subplots(figsize=(16, 6))

sns.distplot(apps['Price'])
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f76db539550>

 Checking types

In [28]:
apps.dtypes
Out[28]:
App          object
Installs      int64
Price       float64
dtype: object
In [29]:
apps.head()
Out[29]:
App Installs Price
0 Photo Editor & Candy Camera & Grid & ScrapBook 10000 1.99
1 Coloring book moana 500000 1.99
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 5000000 1.99
3 Sketch - Draw & Paint 50000000 1.99
4 Pixel Draw - Number Art Coloring Book 100000 1.99

Now every column seems to have the correct type.

purple-divider

Notebooks AI
Notebooks AI Profile20060