Profile picture

Conditional Selection and Filtering on Pandas Series

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Conditional Selection & Filtering on Pandas Series

In conditional selection (also known as boolean selection), we will select subsets of data based on the actual values of the data in the Series by using a boolean vector to filter the data.

purple-divider

Hands on!

In [1]:
import pandas as pd
import numpy as np

green-divider

The first thing we'll do is create again the Series from our previous lecture:

In [2]:
data_dic = {
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}

g7_pop = pd.Series(data_dic,
                   name='G7 Population in millions')
In [3]:
g7_pop
Out[3]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

Summary of selection (from previous lesson):

In [4]:
g7_pop['France']
Out[4]:
63.951
In [5]:
g7_pop.loc['France']
Out[5]:
63.951
In [6]:
g7_pop.iloc[0]
Out[6]:
35.467

green-divider

Conditional selection ( boolean arrays)

The same boolean array techniques we saw applied to numpy arrays can be used for Pandas Series.

On previous lecture we saw that we can index our Series using a list of boolean values:

In [7]:
g7_pop[[False, True,  True, True,  False, False,  False]]
Out[7]:
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

More documented:

In [8]:
g7_pop[[
    False, # CA
    True,  # Fr
    True,  # GE
    True,  # IT
    False, # JA
    False, # UK
    False  #US
]]
Out[8]:
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

Now we'll go a step further and use a real condition to generate these list of boolean values:

In [9]:
condition = g7_pop > 70

condition
Out[9]:
Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool
In [10]:
g7_pop[condition]
Out[10]:
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [11]:
g7_pop.loc[g7_pop > 70]
Out[11]:
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [12]:
g7_pop.mean()
Out[12]:
107.30257142857144
In [13]:
g7_pop[g7_pop > g7_pop.mean()]
Out[13]:
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [14]:
g7_pop.loc[g7_pop > g7_pop.mean()]
Out[14]:
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [15]:
g7_pop.loc[g7_pop > g7_pop.mean()].size
Out[15]:
2

 Operators

 or

In [16]:
g7_pop[(g7_pop > 70) | (g7_pop < 40)]
Out[16]:
Canada            35.467
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

and

In [17]:
g7_pop[(g7_pop > 80) & (g7_pop < 200)]
Out[17]:
Germany     80.940
Japan      127.061
Name: G7 Population in millions, dtype: float64

not

In [18]:
g7_pop.loc[~(g7_pop > 80)]
Out[18]:
Canada            35.467
France            63.951
Italy             60.665
United Kingdom    64.511
Name: G7 Population in millions, dtype: float64
In [19]:
g7_pop.loc[g7_pop > 80]
Out[19]:
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [20]:
g7_pop[g7_pop > g7_pop.mean()]
Out[20]:
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [21]:
g7_pop.std()
Out[21]:
97.24996987121581
In [22]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]
Out[22]:
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

Indexing with isin

Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:

In [23]:
g7_pop
Out[23]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [24]:
g7_pop[g7_pop.isin([80, 80.940, 60.451, 35.467])]
Out[24]:
Canada     35.467
Germany    80.940
Name: G7 Population in millions, dtype: float64
In [25]:
g7_pop[g7_pop.index.isin(['Canada', 'Italy'])]
Out[25]:
Canada    35.467
Italy     60.665
Name: G7 Population in millions, dtype: float64

green-divider

Modifying series using conditional selection

In [26]:
g7_pop[g7_pop < 70] = 99.99

g7_pop
Out[26]:
Canada             99.990
France             99.990
Germany            80.940
Italy              99.990
Japan             127.061
United Kingdom     99.990
United States     318.523
Name: G7 Population in millions, dtype: float64

Also we can combine +=, -=, *= operations while modifying values.

Lets remove 5 million from countries with population >100M:

In [27]:
g7_pop[g7_pop > 100] += 5

g7_pop
Out[27]:
Canada             99.990
France             99.990
Germany            80.940
Italy              99.990
Japan             132.061
United Kingdom     99.990
United States     323.523
Name: G7 Population in millions, dtype: float64

purple-divider

Notebooks AI
Notebooks AI Profile20060