Profile picture

Pandas String Handling

Last updated: November 25th, 20192019-11-25Project preview

rmotr


 Pandas String handling (.str attribute)

Cleaning text values can be incredibly hard. Invalid text values involves, 99% of the time, mistyping, which is completely unpredictable and doesn't follow any pattern. Thankfully, it's not so common these days, where data-entry tasks have been replaced by machines. Still, let's explore the most common cases:

  • .str.split()
  • .str.contains()
  • .str.replace()
  • .str.strip()
  • .str.lower()
  • .str.upper()

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

green-divider

Splitting Columns using .str.split()

The result of a survey is loaded and this is what you get:

In [2]:
df = pd.DataFrame({
    'Data': [
        '1987_M_US _1',
        '1990?_M_UK_1',
        '1992_F_US_2',
        '1970?_M_   IT_1',
        '1985_F_I  T_2'
]})
In [3]:
df
Out[3]:
Data
0 1987_M_US _1
1 1990?_M_UK_1
2 1992_F_US_2
3 1970?_M_ IT_1
4 1985_F_I T_2

You know that the single columns represent the values "year, Sex, Country and number of children", but it's all been grouped in the same column and separated by an underscore. Pandas has a convenient method named split that we can use in these situations:

In [4]:
df['Data'].str.split('_')
Out[4]:
0       [1987, M, US , 1]
1       [1990?, M, UK, 1]
2        [1992, F, US, 2]
3    [1970?, M,    IT, 1]
4      [1985, F, I  T, 2]
Name: Data, dtype: object
In [5]:
df['Data'].str.split('_', expand=True)
Out[5]:
0 1 2 3
0 1987 M US 1
1 1990? M UK 1
2 1992 F US 2
3 1970? M IT 1
4 1985 F I T 2
In [6]:
df = df['Data'].str.split('_', expand=True)

df
Out[6]:
0 1 2 3
0 1987 M US 1
1 1990? M UK 1
2 1992 F US 2
3 1970? M IT 1
4 1985 F I T 2
In [7]:
df.columns = ['Year', 'Sex', 'Country', 'No Children']

df
Out[7]:
Year Sex Country No Children
0 1987 M US 1
1 1990? M UK 1
2 1992 F US 2
3 1970? M IT 1
4 1985 F I T 2

green-divider

Checking existance with .str.contains()

You can also check which columns contain a given value with the contains method:

In [8]:
df
Out[8]:
Year Sex Country No Children
0 1987 M US 1
1 1990? M UK 1
2 1992 F US 2
3 1970? M IT 1
4 1985 F I T 2
In [9]:
df['Year'].str.contains('\?')
Out[9]:
0    False
1     True
2    False
3     True
4    False
Name: Year, dtype: bool

contains takes a regex/pattern as first value, so we need to escape the ? symbol as it has a special meaning for these patterns. Regular letters don't need escaping:

In [10]:
df['Country'].str.contains('U')
Out[10]:
0     True
1     True
2     True
3    False
4    False
Name: Country, dtype: bool

How many records from IT do we have?

In [11]:
df['Country'].str.contains('IT').sum()
Out[11]:
1

green-divider

Replacing values using .str.replace()

In [12]:
df
Out[12]:
Year Sex Country No Children
0 1987 M US 1
1 1990? M UK 1
2 1992 F US 2
3 1970? M IT 1
4 1985 F I T 2

In the last row of the Country column there is a whitespace character, let's replace it with a empty character:

In [13]:
df['Country'].str.replace(' ', '')
Out[13]:
0    US
1    UK
2    US
3    IT
4    IT
Name: Country, dtype: object

As we said, replace and contains take regex patterns, which can make it easier to replace values in bulk:

In [14]:
df['Year'].str.replace(r'(?P<year>\d{4})\?', lambda m: m.group('year'))
Out[14]:
0    1987
1    1990
2    1992
3    1970
4    1985
Name: Year, dtype: object
In [ ]:
 

But, be warned:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

green-divider

Removing whitespace characters with .str.strip()

Removing blank spaces (like in 'US ' or ' UK') can also be achieved with strip (lstrip and rstrip also exist):

In [15]:
df['Country']
Out[15]:
0      US 
1       UK
2       US
3       IT
4     I  T
Name: Country, dtype: object
In [16]:
df['Country'].str.strip()
Out[16]:
0      US
1      UK
2      US
3      IT
4    I  T
Name: Country, dtype: object

Note that whitespace characters were removed from the start and end of each value, but middle whitespace characters were kept.

green-divider

Lowercasing strings using .str.lower()

We could want to have all characters in lower case, to do that we can use the .str.lower() method:

In [17]:
df['Sex']
Out[17]:
0    M
1    M
2    F
3    M
4    F
Name: Sex, dtype: object
In [18]:
df['Sex'].str.lower()
Out[18]:
0    m
1    m
2    f
3    m
4    f
Name: Sex, dtype: object
In [19]:
df['Country'] = df['Country'].str.replace(' ', '').str.lower()
In [20]:
df['Country']
Out[20]:
0    us
1    uk
2    us
3    it
4    it
Name: Country, dtype: object

green-divider

Uppercasing strings using .str.upper()

In [21]:
df['Country'] = df['Country'].str.upper()
In [22]:
df['Country']
Out[22]:
0    US
1    UK
2    US
3    IT
4    IT
Name: Country, dtype: object

As you can see, all these string/text-related operations are applied over the str attribute of the series. That's because they have a special place in Series handling and you can read more about it here.

green-divider

String handling + Fixing invalid types

Let's again take a look at the Google Play Store Apps dataset.

We already fixed some invalid column types on previous lectures, but what about Installs and Price columns?

Both columns are string (object) type, because they have some special characters, but they really should be int and float types respectively.

In [23]:
apps = pd.read_csv('googleplaystore.csv', usecols=['App', 'Installs', 'Price'])

apps.head()
Out[23]:
App Installs Price
0 Photo Editor & Candy Camera & Grid & ScrapBook 10,000+ $1.99
1 Coloring book moana 500,000+ $1.99
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 5,000,000+ $1.99
3 Sketch - Draw & Paint 50,000,000+ $1.99
4 Pixel Draw - Number Art Coloring Book 100,000+ $1.99
In [24]:
apps.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 3 columns):
App         1025 non-null object
Installs    1025 non-null object
Price       1023 non-null object
dtypes: object(3)
memory usage: 24.1+ KB

The first thing we'll do is removing missing values:

In [25]:
apps.dropna(inplace=True)
In [26]:
apps.head()
Out[26]:
App Installs Price
0 Photo Editor & Candy Camera & Grid & ScrapBook 10,000+ $1.99
1 Coloring book moana 500,000+ $1.99
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 5,000,000+ $1.99
3 Sketch - Draw & Paint 50,000,000+ $1.99
4 Pixel Draw - Number Art Coloring Book 100,000+ $1.99

Now we'll need to do the following replacements within the string values:

  • Installs column: remove the + and , signs, then cast to int;
  • Price column: remove the $ and X signs, then cast to float;

 Installs

In [27]:
apps['Installs'].unique()
Out[27]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+'], dtype=object)
In [28]:
apps['Installs'] = apps['Installs'].str.replace('+', '')
apps['Installs'] = apps['Installs'].str.replace(',', '')
In [29]:
apps['Installs'] = apps['Installs'].astype('int')
In [30]:
apps['Installs'].unique()
Out[30]:
array([     10000,     500000,    5000000,   50000000,     100000,
            50000,    1000000,   10000000,       5000,  100000000,
       1000000000,       1000,  500000000,         50,        100,
              500,         10,          1,          5])
In [31]:
fig, ax = plt.subplots(figsize=(14, 6))

sns.distplot(apps['Installs'])
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f24fb78bf98>

 Price

In [32]:
apps['Price'].unique()
Out[32]:
array(['$1.99 ', '$0 ', 'X', '$0.99 '], dtype=object)
In [33]:
apps['Price'] = apps['Price'].str.replace('$', '')
apps['Price'] = apps['Price'].str.replace('X', '0')
In [34]:
apps['Price'] = apps['Price'].astype('float')
In [35]:
apps['Price'].unique()
Out[35]:
array([1.99, 0.  , 0.99])
In [36]:
fig, ax = plt.subplots(figsize=(14, 6))

sns.distplot(apps['Price'])
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f24f94b6940>

 Checking types

In [37]:
apps.dtypes
Out[37]:
App          object
Installs      int64
Price       float64
dtype: object
In [38]:
apps.head()
Out[38]:
App Installs Price
0 Photo Editor & Candy Camera & Grid & ScrapBook 10000 1.99
1 Coloring book moana 500000 1.99
2 U Launcher Lite – FREE Live Cool Themes, Hide ... 5000000 1.99
3 Sketch - Draw & Paint 50000000 1.99
4 Pixel Draw - Number Art Coloring Book 100000 1.99

Now every column seems to have the correct type.

purple-divider

Notebooks AI
Notebooks AI Profile20060