Handling Missing Data with Pandas¶
pandas borrows all the capabilities from numpy selection + adds a number of convenient methods to handle missing values. Let's see one at a time:
Hands on!¶
import numpy as np
import pandas as pd
Pandas utility functions¶
Similarly to numpy
, pandas also has a few utility functions to identify and detect null values:
pd.isnull(np.nan)
pd.isnull(None)
pd.isna(np.nan)
pd.isna(None)
The opposite ones also exist:
pd.notnull(None)
pd.notna(None)
pd.notnull(np.nan)
pd.notnull(3)
These functions also work with Series and DataFrame
s:
pd.isnull(pd.Series([1, np.nan, 7]))
pd.notnull(pd.Series([1, np.nan, 7]))
For Dataframes, works in the same way:
A = pd.DataFrame({
'Column A': [1, np.nan, 7],
'Column B': [np.nan, 2, 3],
'Column C': [np.nan, 2, np.nan]
})
pd.isnull(A)
Pandas Operations with Missing Values¶
Pandas manages missing values more gracefully than numpy. nan
s will no longer behave as "viruses", and operations will just ignore them completely:
pd.Series([1, 2, np.nan]).count()
pd.Series([1, 2, np.nan]).sum()
pd.Series([2, 2, np.nan]).mean()
Filtering missing data¶
As we saw with numpy, we could combine boolean selection + pd.isnull
to filter out those nan
s and null values:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])
pd.notnull(s)
pd.notnull(s).count()
s[pd.notnull(s)]
But both notnull
and isnull
are also methods of Series
and DataFrame
s, so we could use it that way:
s.isnull()
s.notnull()
s[s.notnull()]
Identifying Null values in Dataframes¶
Dataframes are, after all, a collection of Series, so we can use the same methods we've seen so far (like isna
, or notna
):
df = pd.DataFrame({
'Column A': [1, np.nan, 30, np.nan],
'Column B': [2, 8, 31, np.nan],
'Column C': [np.nan, 9, 32, 100],
'Column D': [5, 8, 34, 110],
})
df
df.isna()
df.notna()
The info
method is also useful for a quick summary of the missing values in each column:
df.info()
As you can see, Column A
has "2 non-null float64" values, and we have 4 total items (RangeInex: 4 entries
). That means that there are 2 null values.
Another useful check is "summing" the result of isna
to get a count of null values:
df.isnull().sum()
We can also get a "proportion" of null values, by just dividing the number of null values per column, by the total number of rows in the Dataframe:
df.isnull().sum() / df.shape[0]
We'll re-iterate on this at the end of this lecture (see Patterns).
Dropping null values¶
Boolean selection + notnull()
seems a little bit verbose and repetitive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can use the dropna
method:
s.dropna()
Dropping null values on DataFrames¶
You saw how simple it is to drop na
s with a Series. But with DataFrame
s, there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample DataFrame
:
df = pd.DataFrame({
'Column A': [1, np.nan, 30, np.nan],
'Column B': [2, 8, 31, np.nan],
'Column C': [np.nan, 9, 32, 100],
'Column D': [5, 8, 34, 110],
})
df
df.isnull()
df.isnull().sum()
The default dropna
behavior will drop all the rows in which any null value is present:
df.dropna()
In this case we're dropping rows. Rows containing null values are dropped from the DF. You can also use the axis
parameter to drop columns containing null values:
df.dropna(axis='columns') # axis=1 also works
In this case, any row or column that contains at least one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the how
parameter. Can be either 'any'
or 'all'
:
df2 = pd.DataFrame({
'Column A': [1, np.nan, 30],
'Column B': [2, np.nan, 31],
'Column C': [np.nan, np.nan, 100]
})
df2
df2.dropna(how='all')
df2.dropna(how='any') # default behavior
You can also use the thresh
parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept:
df
df.dropna(thresh=3)
df.dropna(thresh=3, axis='columns')
Filling null values¶
Sometimes instead than dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a nan
can be replaced with a 0
, sometimes it can be replaced with the mean
of the sample, and some other times you can take the closest value. Again, it depends on the context. We'll show you the different methods and mechanisms and you can then apply them to your own problem.
s
Filling nulls with a arbitrary value
s.fillna(0)
s.fillna(s.mean())
s
Filling nulls with contiguous (close) values
The method
argument is used to fill null values with other values close to that null one:
s.fillna(method='ffill')
s.fillna(method='bfill')
This can still leave null values at the extremes of the Series/DataFrame:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')
Filling null values on DataFrames¶
The fillna
method also works on DataFrame
s, and it works similarly. The main differences are that you can specify the axis
(as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed:
df
df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()})
df.fillna(method='ffill', axis=0)
df.fillna(method='ffill', axis=1)
Summary: common patterns to find and count null values¶
There are a couple of patterns we use to check if there are any null values in our Dataframe/Series (the answer is True
/ False
) and also to check how many null values we have. This is very similar to what we did with numpy
.
s
1) Check if there's at least one null value:¶
s.isna().any()
2) Check if ALL are null values:¶
s.isna().all()
pd.Series([np.nan, None, np.nan]).isna().all()
3) Check how many null values we have:¶
s.isna().sum()
4) Check how many *NOT* null values we have:¶
s.notna().sum()