# Missing Values in Python and Numpy

Last updated: May 22nd, 2019

# Missing values in Python and numpy¶

In this section, we will dive into how "missing (also referred to as NA or NaN) values" looks like in Python and numpy.

## Hands on!¶

In [1]:
import numpy as np
import pandas as pd


## Falsy values in Python¶

As we saw on previous lecture what a missing value is depends on the origin of the data and the context it was generated.

These concepts are related to the values that Python will consider "Falsy":

In [2]:
falsy_values = (0, 0.0, False, None, '', [], {}, ())


For Python, all the values above are considered "falsy":

In [3]:
[bool(x) for x in falsy_values]

Out[3]:
[False, False, False, False, False, False, False, False]
In [4]:
any(falsy_values)

Out[4]:
False

## The NaN object in numpy¶

Numpy has a special "nullable" value for numbers which is np.nan. It's NaN: "Not a number"

In [5]:
np.nan

Out[5]:
nan

The np.nan value is kind of a virus. Everything that it touches becomes np.nan:

In [6]:
3 + np.nan

Out[6]:
nan
In [7]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [8]:
a.sum()

Out[8]:
nan
In [9]:
a.mean()

Out[9]:
nan

This is better than regular None values, which in the previous examples would have raised an exception:

In [10]:
3 + None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-8e4e7b6bbb3a> in <module>
----> 1 3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

For a numeric array, the None value is replaced by np.nan:

In [11]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

a

Out[11]:
array([ 1.,  2.,  3., nan, nan,  4.])

As we said, np.nan is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In [12]:
a.mean()

Out[12]:
nan
In [13]:
a.sum()

Out[13]:
nan

## The inf object in numpy¶

Numpy also supports an "Infinite" type:

In [14]:
np.inf

Out[14]:
inf

Which also behaves as a virus:

In [15]:
3 + np.inf

Out[15]:
inf
In [16]:
np.inf / 3

Out[16]:
inf
In [17]:
np.inf / np.inf

Out[17]:
nan
In [18]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)

b

Out[18]:
array([ 1.,  2.,  3., inf, nan,  4.])
In [19]:
b.sum()

Out[19]:
nan

## Checking for nan or inf¶

There are two functions: np.isnan and np.isinf that will perform the desired checks:

In [20]:
np.isnan(np.nan)

Out[20]:
True
In [21]:
np.isinf(np.inf)

Out[21]:
True

And the joint operation can be performed with np.isfinite.

In [22]:
np.isfinite(np.nan), np.isfinite(np.inf)

Out[22]:
(False, False)

np.isnan and np.isinf also take arrays as inputs, and return boolean arrays as results:

In [23]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

Out[23]:
array([False, False, False,  True, False, False])
In [24]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

Out[24]:
array([False, False, False, False,  True, False])
In [25]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

Out[25]:
array([ True,  True,  True, False, False,  True])

Note: It's not so common to find infinite values. From now on, we'll keep working with only np.nan

## Filtering them out¶

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid nan propagation. We'll use a combination of the previous np.isnan + boolean arrays for this purpose:

In [26]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [27]:
a[~np.isnan(a)]

Out[27]:
array([1., 2., 3., 4.])

Which is equivalent to:

In [28]:
a[np.isfinite(a)]

Out[28]:
array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In [29]:
a[np.isfinite(a)].sum()

Out[29]:
10.0
In [30]:
a[np.isfinite(a)].mean()

Out[30]:
2.5