 Missing Values in Python and Numpy

Last updated: May 22nd, 2019  Missing values in Python and numpy¶

In this section, we will dive into how "missing (also referred to as NA or NaN) values" looks like in Python and numpy. Hands on!¶

In :
import numpy as np
import pandas as pd Falsy values in Python¶

As we saw on previous lecture what a missing value is depends on the origin of the data and the context it was generated.

These concepts are related to the values that Python will consider "Falsy":

In :
falsy_values = (0, 0.0, False, None, '', [], {}, ())


For Python, all the values above are considered "falsy":

In :
[bool(x) for x in falsy_values]

Out:
[False, False, False, False, False, False, False, False]
In :
any(falsy_values)

Out:
False The NaN object in numpy¶

Numpy has a special "nullable" value for numbers which is np.nan. It's NaN: "Not a number"

In :
np.nan

Out:
nan

The np.nan value is kind of a virus. Everything that it touches becomes np.nan:

In :
3 + np.nan

Out:
nan
In :
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In :
a.sum()

Out:
nan
In :
a.mean()

Out:
nan

This is better than regular None values, which in the previous examples would have raised an exception:

In :
3 + None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-8e4e7b6bbb3a> in <module>
----> 1 3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

For a numeric array, the None value is replaced by np.nan:

In :
a = np.array([1, 2, 3, np.nan, np.nan, 4])

a

Out:
array([ 1.,  2.,  3., nan, nan,  4.])

As we said, np.nan is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In :
a.mean()

Out:
nan
In :
a.sum()

Out:
nan The inf object in numpy¶

Numpy also supports an "Infinite" type:

In :
np.inf

Out:
inf

Which also behaves as a virus:

In :
3 + np.inf

Out:
inf
In :
np.inf / 3

Out:
inf
In :
np.inf / np.inf

Out:
nan
In :
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)

b

Out:
array([ 1.,  2.,  3., inf, nan,  4.])
In :
b.sum()

Out:
nan Checking for nan or inf¶

There are two functions: np.isnan and np.isinf that will perform the desired checks:

In :
np.isnan(np.nan)

Out:
True
In :
np.isinf(np.inf)

Out:
True

And the joint operation can be performed with np.isfinite.

In :
np.isfinite(np.nan), np.isfinite(np.inf)

Out:
(False, False)

np.isnan and np.isinf also take arrays as inputs, and return boolean arrays as results:

In :
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

Out:
array([False, False, False,  True, False, False])
In :
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

Out:
array([False, False, False, False,  True, False])
In :
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

Out:
array([ True,  True,  True, False, False,  True])

Note: It's not so common to find infinite values. From now on, we'll keep working with only np.nan Filtering them out¶

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid nan propagation. We'll use a combination of the previous np.isnan + boolean arrays for this purpose:

In :
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In :
a[~np.isnan(a)]

Out:
array([1., 2., 3., 4.])

Which is equivalent to:

In :
a[np.isfinite(a)]

Out:
array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In :
a[np.isfinite(a)].sum()

Out:
10.0
In :
a[np.isfinite(a)].mean()

Out:
2.5 