Profile picture

2.2 - Handling Missing Data With Pandas

Last updated: May 17th, 20192019-05-17Project preview

rmotr


Handling Missing Data with Pandas

pandas borrows all the capabilities from numpy selection + adds a number of convenient methods to handle missing values. Let's see one at a time:

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd

Pandas utility functions

Similarly to numpy, pandas also has a few utility functions to identify and detect null values:

In [ ]:
pd.isnull(np.nan)
In [ ]:
pd.isnull(None)
In [ ]:
pd.isna(np.nan)
In [ ]:
pd.isna(None)

The opposite ones also exist:

In [ ]:
pd.notnull(None)
In [ ]:
pd.notna(None)
In [ ]:
pd.notnull(np.nan)
In [ ]:
pd.notnull(3)

These functions also work with Series and DataFrames:

In [ ]:
pd.isnull(pd.Series([1, np.nan, 7]))
In [ ]:
pd.notnull(pd.Series([1, np.nan, 7]))

For Dataframes, works in the same way:

In [ ]:
A = pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
})
In [ ]:
pd.isnull(A)

green-divider

Pandas Operations with Missing Values

Pandas manages missing values more gracefully than numpy. nans will no longer behave as "viruses", and operations will just ignore them completely:

In [ ]:
pd.Series([1, 2, np.nan]).count()
In [ ]:
pd.Series([1, 2, np.nan]).sum()
In [ ]:
pd.Series([2, 2, np.nan]).mean()

Filtering missing data

As we saw with numpy, we could combine boolean selection + pd.isnull to filter out those nans and null values:

In [ ]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])
In [ ]:
pd.notnull(s)
In [ ]:
pd.notnull(s).count()
In [ ]:
s[pd.notnull(s)]

But both notnull and isnull are also methods of Series and DataFrames, so we could use it that way:

In [ ]:
s.isnull()
In [ ]:
s.notnull()
In [ ]:
s[s.notnull()]

green-divider

Identifying Null values in Dataframes

Dataframes are, after all, a collection of Series, so we can use the same methods we've seen so far (like isna, or notna):

In [ ]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})
In [ ]:
df
In [ ]:
df.isna()
In [ ]:
df.notna()

The info method is also useful for a quick summary of the missing values in each column:

In [ ]:
df.info()

As you can see, Column A has "2 non-null float64" values, and we have 4 total items (RangeInex: 4 entries). That means that there are 2 null values.

Another useful check is "summing" the result of isna to get a count of null values:

In [ ]:
df.isnull().sum()

We can also get a "proportion" of null values, by just dividing the number of null values per column, by the total number of rows in the Dataframe:

In [ ]:
df.isnull().sum() / df.shape[0]

We'll re-iterate on this at the end of this lecture (see Patterns).

green-divider

Dropping null values

Boolean selection + notnull() seems a little bit verbose and repetitive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can use the dropna method:

In [ ]:
s.dropna()

Dropping null values on DataFrames

You saw how simple it is to drop nas with a Series. But with DataFrames, there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample DataFrame:

In [ ]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})
In [ ]:
df
In [ ]:
df.isnull()
In [ ]:
df.isnull().sum()

The default dropna behavior will drop all the rows in which any null value is present:

In [ ]:
df.dropna()

In this case we're dropping rows. Rows containing null values are dropped from the DF. You can also use the axis parameter to drop columns containing null values:

In [ ]:
df.dropna(axis='columns')  # axis=1 also works

In this case, any row or column that contains at least one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the how parameter. Can be either 'any' or 'all':

In [ ]:
df2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})
In [ ]:
df2
In [ ]:
df2.dropna(how='all')
In [ ]:
df2.dropna(how='any')  # default behavior

You can also use the thresh parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept:

In [ ]:
df
In [ ]:
df.dropna(thresh=3)
In [ ]:
df.dropna(thresh=3, axis='columns')

green-divider

Filling null values

Sometimes instead than dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a nan can be replaced with a 0, sometimes it can be replaced with the mean of the sample, and some other times you can take the closest value. Again, it depends on the context. We'll show you the different methods and mechanisms and you can then apply them to your own problem.

In [ ]:
s

Filling nulls with a arbitrary value

In [ ]:
s.fillna(0)
In [ ]:
s.fillna(s.mean())
In [ ]:
s

Filling nulls with contiguous (close) values

The method argument is used to fill null values with other values close to that null one:

In [ ]:
s.fillna(method='ffill')
In [ ]:
s.fillna(method='bfill')

This can still leave null values at the extremes of the Series/DataFrame:

In [ ]:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')
In [ ]:
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')

Filling null values on DataFrames

The fillna method also works on DataFrames, and it works similarly. The main differences are that you can specify the axis (as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed:

In [ ]:
df
In [ ]:
df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()})
In [ ]:
df.fillna(method='ffill', axis=0)
In [ ]:
df.fillna(method='ffill', axis=1)

green-divider

Summary: common patterns to find and count null values

There are a couple of patterns we use to check if there are any null values in our Dataframe/Series (the answer is True / False) and also to check how many null values we have. This is very similar to what we did with numpy.

In [ ]:
s
1) Check if there's at least one null value:
In [ ]:
s.isna().any()
2) Check if ALL are null values:
In [ ]:
s.isna().all()
In [ ]:
pd.Series([np.nan, None, np.nan]).isna().all()
3) Check how many null values we have:
In [ ]:
s.isna().sum()
4) Check how many *NOT* null values we have:
In [ ]:
s.notna().sum()

purple-divider

Notebooks AI
Notebooks AI Profile20060