Profile picture

Missing Values in Python and Numpy

Last updated: May 22nd, 20192019-05-22Project preview

rmotr


 Missing values in Python and numpy

In this section, we will dive into how "missing (also referred to as NA or NaN) values" looks like in Python and numpy.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

 Falsy values in Python

As we saw on previous lecture what a missing value is depends on the origin of the data and the context it was generated.

These concepts are related to the values that Python will consider "Falsy":

In [2]:
falsy_values = (0, 0.0, False, None, '', [], {}, ())

For Python, all the values above are considered "falsy":

In [3]:
[bool(x) for x in falsy_values]
Out[3]:
[False, False, False, False, False, False, False, False]
In [4]:
any(falsy_values)
Out[4]:
False

green-divider

 The NaN object in numpy

Numpy has a special "nullable" value for numbers which is np.nan. It's NaN: "Not a number"

In [5]:
np.nan
Out[5]:
nan

The np.nan value is kind of a virus. Everything that it touches becomes np.nan:

In [6]:
3 + np.nan
Out[6]:
nan
In [7]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])
In [8]:
a.sum()
Out[8]:
nan
In [9]:
a.mean()
Out[9]:
nan

This is better than regular None values, which in the previous examples would have raised an exception:

In [10]:
3 + None
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-8e4e7b6bbb3a> in <module>
----> 1 3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

For a numeric array, the None value is replaced by np.nan:

In [11]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

a
Out[11]:
array([ 1.,  2.,  3., nan, nan,  4.])

As we said, np.nan is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In [12]:
a.mean()
Out[12]:
nan
In [13]:
a.sum()
Out[13]:
nan

green-divider

 The inf object in numpy

Numpy also supports an "Infinite" type:

In [14]:
np.inf
Out[14]:
inf

Which also behaves as a virus:

In [15]:
3 + np.inf
Out[15]:
inf
In [16]:
np.inf / 3
Out[16]:
inf
In [17]:
np.inf / np.inf
Out[17]:
nan
In [18]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)

b
Out[18]:
array([ 1.,  2.,  3., inf, nan,  4.])
In [19]:
b.sum()
Out[19]:
nan

green-divider

Checking for nan or inf

There are two functions: np.isnan and np.isinf that will perform the desired checks:

In [20]:
np.isnan(np.nan)
Out[20]:
True
In [21]:
np.isinf(np.inf)
Out[21]:
True

And the joint operation can be performed with np.isfinite.

In [22]:
np.isfinite(np.nan), np.isfinite(np.inf)
Out[22]:
(False, False)

np.isnan and np.isinf also take arrays as inputs, and return boolean arrays as results:

In [23]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))
Out[23]:
array([False, False, False,  True, False, False])
In [24]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))
Out[24]:
array([False, False, False, False,  True, False])
In [25]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))
Out[25]:
array([ True,  True,  True, False, False,  True])

Note: It's not so common to find infinite values. From now on, we'll keep working with only np.nan

green-divider

Filtering them out

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid nan propagation. We'll use a combination of the previous np.isnan + boolean arrays for this purpose:

In [26]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])
In [27]:
a[~np.isnan(a)]
Out[27]:
array([1., 2., 3., 4.])

Which is equivalent to:

In [28]:
a[np.isfinite(a)]
Out[28]:
array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In [29]:
a[np.isfinite(a)].sum()
Out[29]:
10.0
In [30]:
a[np.isfinite(a)].mean()
Out[30]:
2.5

purple-divider

Notebooks AI
Notebooks AI Profile20060