Profile picture

Dealing With Invalid Types

Last updated: June 11th, 20192019-06-11Project preview

rmotr


 Dealing with invalid types

So far we learned how to get rid of missing and duplicated values, now we'll check our data and identify any invalid type values on it.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

Casting a pandas object

Before continue let's see how can we cast a pandas object to a specified dtype using the astype method.

The astype also provides the capability to convert any suitable existing column to categorical type.

In [2]:
string = pd.Series(['1', '2', '3', '4', '5'])

string
Out[2]:
0    1
1    2
2    3
3    4
4    5
dtype: object
In [3]:
numbers = string.astype('int')

numbers
Out[3]:
0    1
1    2
2    3
3    4
4    5
dtype: int64
In [4]:
numbers.dtype
Out[4]:
dtype('int64')

We can also change it back to string:

In [5]:
string = numbers.astype('str')

string
Out[5]:
0    1
1    2
2    3
3    4
4    5
dtype: object
In [6]:
numbers.dtype
Out[6]:
dtype('int64')

green-divider

Top-level conversions

Sometimes is necessary to convert data to a numeric type. To do that, Pandas give us an important function: pd.to_numeric().

Suppose we have the following data:

In [7]:
mixed_data = pd.Series([np.nan, 10, -20, 'Hello World'])

mixed_data
Out[7]:
0            NaN
1             10
2            -20
3    Hello World
dtype: object

Null values, positive numbers, negative numbers, strings, mixed all together will raise an error...

In [8]:
all_numeric = pd.to_numeric(mixed_data)

all_numeric
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Hello World"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-8-c8a3671f2180> in <module>
----> 1 all_numeric = pd.to_numeric(mixed_data)
      2 
      3 all_numeric

/usr/local/lib/python3.6/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    131             coerce_numeric = False if errors in ('ignore', 'raise') else True
    132             values = lib.maybe_convert_numeric(values, set(),
--> 133                                                coerce_numeric=coerce_numeric)
    134 
    135     except Exception:

pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "Hello World" at position 3

Let's see how can we coerce all the values to a numeric type.

First we want to call pd.to_numeric() function with errors='coerce' parameter. This will parse all the data to numeric type and if any value can't be converted it will be coerced into NaN.

In [9]:
all_numeric = pd.to_numeric(mixed_data, errors='coerce')

all_numeric
Out[9]:
0     NaN
1    10.0
2   -20.0
3     NaN
dtype: float64
In [10]:
all_numeric.dtype
Out[10]:
dtype('float64')

Then we can also make all the data positive:

In [11]:
all_positive = abs(all_numeric)

all_positive
Out[11]:
0     NaN
1    10.0
2    20.0
3     NaN
dtype: float64

Finally, fill NaN values with a 0:

In [12]:
all_filled = all_positive.fillna(0)

all_filled
Out[12]:
0     0.0
1    10.0
2    20.0
3     0.0
dtype: float64

green-divider

Identifying invalid types

Now go a step further and load the Google Play Store Apps dataset:

In [13]:
apps = pd.read_csv('data/googleplaystore.csv')

apps.head()
Out[13]:
App Category Rating Reviews Installs Price Content Rating Genres Last Updated
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159.0 10,000+ $1.99 Everyone Art & Design January 7, 2018
1 Coloring book moana ART_AND_DESIGN 3.9 967.0 500,000+ $1.99 Everyone Art & Design;Pretend Play January 15, 2018
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510.0 5,000,000+ $1.99 Everyone Art & Design August 1, 2018
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644.0 50,000,000+ $1.99 Teen Art & Design June 8, 2018
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967.0 100,000+ $1.99 Everyone Art & Design;Creativity June 20, 2018
In [14]:
apps.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 9 columns):
App               1025 non-null object
Category          1025 non-null object
Rating            971 non-null float64
Reviews           1021 non-null float64
Installs          1025 non-null object
Price             1023 non-null object
Content Rating    1025 non-null object
Genres            1025 non-null object
Last Updated      1025 non-null object
dtypes: float64(2), object(7)
memory usage: 72.1+ KB
In [15]:
apps.get_dtype_counts()
Out[15]:
float64    2
object     7
dtype: int64

We see some wrong types:

  • Category, Content Rating and Genres columns are object type, while they should be category;
  • Reviews column is float64 type, while it should be int;
  • Last Updated column is object type, while it should be date;

green-divider

Fixing invalid types

The first thing we'll do is removing missing values:

In [16]:
apps.dropna(inplace=True)
In [17]:
apps.head()
Out[17]:
App Category Rating Reviews Installs Price Content Rating Genres Last Updated
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159.0 10,000+ $1.99 Everyone Art & Design January 7, 2018
1 Coloring book moana ART_AND_DESIGN 3.9 967.0 500,000+ $1.99 Everyone Art & Design;Pretend Play January 15, 2018
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510.0 5,000,000+ $1.99 Everyone Art & Design August 1, 2018
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644.0 50,000,000+ $1.99 Teen Art & Design June 8, 2018
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967.0 100,000+ $1.99 Everyone Art & Design;Creativity June 20, 2018

We'll fix Reviews column by casting it to int type:

In [18]:
apps['Reviews'] = apps['Reviews'].astype('int')

Now we'll fix Category, Content Rating and Genres columns by casting them to category type.

To cast these three columns at the same time, we'll pass a dictionary to astype method:

In [19]:
apps = apps.astype({
    'Category': 'category',
    'Content Rating': 'category',
    'Genres': 'category'
})

Finally, let's make a top-level conversion of Last Updated column to datetime:

In [20]:
apps['Last Updated'] = pd.to_datetime(apps['Last Updated'])

Check if every type was changed correctly:

In [21]:
apps.dtypes
Out[21]:
App                       object
Category                category
Rating                   float64
Reviews                    int64
Installs                  object
Price                     object
Content Rating          category
Genres                  category
Last Updated      datetime64[ns]
dtype: object
In [22]:
apps.get_dtype_counts()
Out[22]:
category          3
datetime64[ns]    1
float64           1
int64             1
object            3
dtype: int64
In [23]:
apps.head()
Out[23]:
App Category Rating Reviews Installs Price Content Rating Genres Last Updated
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 10,000+ $1.99 Everyone Art & Design 2018-01-07
1 Coloring book moana ART_AND_DESIGN 3.9 967 500,000+ $1.99 Everyone Art & Design;Pretend Play 2018-01-15
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 5,000,000+ $1.99 Everyone Art & Design 2018-08-01
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 50,000,000+ $1.99 Teen Art & Design 2018-06-08
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 100,000+ $1.99 Everyone Art & Design;Creativity 2018-06-20

Now every column seems to have the correct type.

purple-divider

Notebooks AI
Notebooks AI Profile20060