Profile picture

Conditional Selection on Pandas DataFrames

Last updated: July 2nd, 20192019-07-02Project preview

rmotr


Conditional selection on Pandas DataFrames

In conditional selection (also known as boolean selection), we will select subsets of data based on the actual values of the data in DataFrame by using a boolean array to filter the data.

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [ ]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [ ]:
df

green-divider

Conditional selection (boolean arrays)

We saw conditional selection applied to Series and it'll work in the same way for DataFrames. After all, a DataFrame is a collection of Series:

In [ ]:
df

The other types of selection we saw before were:

In [ ]:
df.loc[['Canada', 'Japan']]
In [ ]:
df.loc[['Canada', 'Japan'], ['Population', 'GDP']]

As we saw with Series and numpy arrays, boolean selection also works:

In [ ]:
rows_condition = [False, True, True, True, False, False, False]
cols_condition = [False, False, True, True, False]

df.loc[rows_condition, cols_condition]

Now we'll go a step further and use a real condition to generate these list of boolean values:

In [ ]:
condition = df['Population'] > 70

condition
In [ ]:
df.loc[condition]
In [ ]:
df.loc[df['Population'] > 70]

The boolean matching is done at Index level, so you can filter by any row, as long as it contains the right indexes. Column selection still works as expected:

In [ ]:
df.loc[df['Population'] > 70, 'Population']
In [ ]:
df.loc[df['Population'] > 70, ['Population', 'GDP']]

Let's get just the countries with higher population than the mean value:

In [ ]:
df['Population'].mean()
In [ ]:
df.loc[df['Population'] > df['Population'].mean(), :]

Operators

or

In [ ]:
df.loc[(df['GDP'] < 3_000_000) | (df['Continent'] == 'America')]

 and

In [ ]:
df
In [ ]:
df.loc[(df['Population'] > 80) & (df['Continent'] == 'Europe')]
In [ ]:
df.loc[(df['Population'] > 80) & (df['Continent'] == 'Europe'), ['Population', 'GDP']]

 not

In [ ]:
df.loc[~(df['Population'] > 80)]
In [ ]:
df.loc[df['Population'] > 80]
In [ ]:
lower_bound = df['Population'] > df['Population'].mean() - df['Population'].std() / 2
upper_bound = df['Population'] > df['Population'].mean() + df['Population'].std() / 2

df.loc[lower_bound | upper_bound, ['Population', 'GDP']]

Indexing with isin

Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:

In [ ]:
df
In [ ]:
df['Continent'].isin(['America', 'Asia'])
In [ ]:
condition = df['Continent'].isin(['America', 'Asia'])

df.loc[condition, ['Population', 'Continent']]
In [ ]:
df.index.isin(['Canada', 'Italy'])
In [ ]:
condition = df.index.isin(['Canada', 'Italy'])

df.loc[condition, ['Population', 'Continent']]

green-divider

Modifying DataFrame using conditional selection

In [ ]:
df.loc[df['Population'] < 70, 'GDP'] = 70
df
In [ ]:
#df.loc[df['Population'] < 70, :] = np.nan
df.loc[df['Population'] < 70] = np.nan

df

purple-divider

Notebooks AI
Notebooks AI Profile20060