Profile picture

Conditional Selection on Pandas Series

Last updated: June 3rd, 20192019-06-03Project preview

rmotr


Conditional selection on Pandas Series

In conditional selection (also known as boolean selection), we will select subsets of data based on the actual values of the data in the Series by using a boolean vector to filter the data.

purple-divider

Hands on!

In [ ]:
import pandas as pd
import numpy as np

green-divider

The first thing we'll do is create again the Series from our previous lecture:

In [ ]:
data_dic = {
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}

g7_pop = pd.Series(data_dic,
                   name='G7 Population in millions')
In [ ]:
g7_pop

Summary of selection (from previous lesson):

In [ ]:
g7_pop['France']
In [ ]:
g7_pop.loc['France']
In [ ]:
g7_pop.iloc[0]

green-divider

Conditional selection ( boolean arrays)

The same boolean array techniques we saw applied to numpy arrays can be used for Pandas Series.

On previous lecture we saw that we can index our Series using a list of boolean values:

In [ ]:
g7_pop[[False, True,  True, True,  False, False,  False]]

More documented:

In [ ]:
g7_pop[[
    False, # CA
    True,  # Fr
    True,  # GE
    True,  # IT
    False, # JA
    False, # UK
    False  #US
]]

Now we'll go a step further and use a real condition to generate these list of boolean values:

In [ ]:
condition = g7_pop > 70

condition
In [ ]:
g7_pop[condition]
In [ ]:
g7_pop.loc[g7_pop > 70]
In [ ]:
g7_pop.mean()
In [ ]:
g7_pop[g7_pop > g7_pop.mean()]
In [ ]:
g7_pop.loc[g7_pop > g7_pop.mean()]
In [ ]:
g7_pop.loc[g7_pop > g7_pop.mean()].size

 Operators

 or

In [ ]:
g7_pop[(g7_pop > 70) | (g7_pop < 40)]

and

In [ ]:
g7_pop[(g7_pop > 80) & (g7_pop < 200)]

not

In [ ]:
g7_pop.loc[~(g7_pop > 80)]
In [ ]:
g7_pop.loc[g7_pop > 80]
In [ ]:
g7_pop[g7_pop > g7_pop.mean()]
In [ ]:
g7_pop.std()
In [ ]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]

Indexing with isin

Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:

In [ ]:
g7_pop
In [ ]:
g7_pop[g7_pop.isin([80, 80.940, 60.451, 35.467])]
In [ ]:
g7_pop[g7_pop.index.isin(['Canada', 'Italy'])]

green-divider

Modifying series using conditional selection

In [ ]:
g7_pop[g7_pop < 70] = 99.99

g7_pop

Also we can combine +=, -=, *= operations while modifying values.

Lets remove 5 million from countries with population >100M:

In [ ]:
g7_pop[g7_pop > 100] += 5

g7_pop

purple-divider

Notebooks AI
Notebooks AI Profile20060