Profile picture

Conditional Selection on Pandas DataFames

Last updated: May 25th, 20192019-05-25Project preview

rmotr


Conditional selection on Pandas DataFrames

In conditional selection (also known as boolean selection), we will select subsets of data based on the actual values of the data in the Series by using a boolean vector to filter the data.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [2]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [3]:
df
Out[3]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America

green-divider

Conditional selection (boolean arrays)

We saw conditional selection applied to Series and it'll work in the same way for DataFrames. After all, a DataFrame is a collection of Series:

In [4]:
df
Out[4]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America

The other types of selection we saw before were:

In [5]:
df.loc[['Canada', 'Japan']]
Out[5]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
Japan 127.061 4602367 377930 0.891 Asia
In [6]:
df.loc[['Canada', 'Japan'], ['Population', 'GDP']]
Out[6]:
Population GDP
Canada 35.467 1785387
Japan 127.061 4602367

As we saw with Series and numpy arrays, boolean selection also works:

In [7]:
df
Out[7]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [8]:
rows_condition = [False, True, True, True, False, False, False]
cols_condition = [False, False, True, True, False]

df.loc[rows_condition, cols_condition]
Out[8]:
Surface Area HDI
France 640679 0.888
Germany 357114 0.916
Italy 301336 0.873
In [9]:
df.loc[rows_condition]
Out[9]:
Population GDP Surface Area HDI Continent
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe

Now we'll go a step further and use a real condition to generate these list of boolean values:

In [10]:
df['Population'] * 1_000_000
Out[10]:
Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: Population, dtype: float64
In [11]:
condition = df['Population'] > 70

condition
Out[11]:
Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Population, dtype: bool
In [12]:
df.loc[condition]
Out[12]:
Population GDP Surface Area HDI Continent
Germany 80.940 3874437 357114 0.916 Europe
Japan 127.061 4602367 377930 0.891 Asia
United States 318.523 17348075 9525067 0.915 America
In [13]:
df.loc[df['Population'] > 70]
Out[13]:
Population GDP Surface Area HDI Continent
Germany 80.940 3874437 357114 0.916 Europe
Japan 127.061 4602367 377930 0.891 Asia
United States 318.523 17348075 9525067 0.915 America

The boolean matching is done at Index level, so you can filter by any row, as long as it contains the right indexes. Column selection still works as expected:

In [14]:
df.loc[df['Population'] > 70, 'Population']
Out[14]:
Germany           80.940
Japan            127.061
United States    318.523
Name: Population, dtype: float64
In [15]:
df.loc[df['Population'] > 70, ['Population', 'GDP']]
Out[15]:
Population GDP
Germany 80.940 3874437
Japan 127.061 4602367
United States 318.523 17348075

Let's get just the countries with higher population than the mean value:

In [16]:
df['Population'].mean()
Out[16]:
107.30257142857144
In [17]:
df.loc[df['Population'] > df['Population'].mean(), :]
Out[17]:
Population GDP Surface Area HDI Continent
Japan 127.061 4602367 377930 0.891 Asia
United States 318.523 17348075 9525067 0.915 America

Operators

or

In [18]:
df.loc[(df['GDP'] < 3_000_000) | (df['Continent'] == 'America')]
Out[18]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Italy 60.665 2167744 301336 0.873 Europe
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America

 and

In [19]:
df
Out[19]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [20]:
df.loc[(df['Population'] > 80) & (df['Continent'] == 'Europe')]
Out[20]:
Population GDP Surface Area HDI Continent
Germany 80.94 3874437 357114 0.916 Europe
In [21]:
df.loc[(df['Population'] > 80) & (df['Continent'] == 'Europe'), ['Population', 'GDP']]
Out[21]:
Population GDP
Germany 80.94 3874437

 not

In [22]:
df.loc[~(df['Population'] > 80)]
Out[22]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Italy 60.665 2167744 301336 0.873 Europe
United Kingdom 64.511 2950039 242495 0.907 Europe
In [23]:
df.loc[df['Population'] <= 80]
Out[23]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Italy 60.665 2167744 301336 0.873 Europe
United Kingdom 64.511 2950039 242495 0.907 Europe
In [24]:
df.loc[df['Population'] > 80]
Out[24]:
Population GDP Surface Area HDI Continent
Germany 80.940 3874437 357114 0.916 Europe
Japan 127.061 4602367 377930 0.891 Asia
United States 318.523 17348075 9525067 0.915 America
In [25]:
lower_bound = df['Population'] > df['Population'].mean() - df['Population'].std() / 2
upper_bound = df['Population'] > df['Population'].mean() + df['Population'].std() / 2

df.loc[lower_bound | upper_bound, ['Population', 'GDP']]
Out[25]:
Population GDP
France 63.951 2833687
Germany 80.940 3874437
Italy 60.665 2167744
Japan 127.061 4602367
United Kingdom 64.511 2950039
United States 318.523 17348075

Indexing with isin

Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:

In [26]:
df
Out[26]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [27]:
df['Continent'].isin(['America', 'Asia'])
Out[27]:
Canada             True
France            False
Germany           False
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Continent, dtype: bool
In [28]:
(df['Continent'] == 'America') | (df['Continent'] == 'Asia')
Out[28]:
Canada             True
France            False
Germany           False
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Continent, dtype: bool
In [29]:
condition = df['Continent'].isin(['America', 'Asia'])

df.loc[condition, ['Population', 'Continent']]
Out[29]:
Population Continent
Canada 35.467 America
Japan 127.061 Asia
United States 318.523 America
In [30]:
df.index.isin(['Canada', 'Italy'])
Out[30]:
array([ True, False, False,  True, False, False, False])
In [31]:
condition = df.index.isin(['Canada', 'Italy'])

df.loc[condition, ['Population', 'Continent']]
Out[31]:
Population Continent
Canada 35.467 America
Italy 60.665 Europe

green-divider

Modifying DataFrame using conditional selection

In [32]:
df
Out[32]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [33]:
df.loc[df['Population'] < 70]
Out[33]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Italy 60.665 2167744 301336 0.873 Europe
United Kingdom 64.511 2950039 242495 0.907 Europe
In [34]:
df.loc[df['Population'] < 70, 'GDP'] = 70
df
Out[34]:
Population GDP Surface Area HDI Continent
Canada 35.467 70 9984670 0.913 America
France 63.951 70 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 70 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 70 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [35]:
#df.loc[df['Population'] < 70, :] = np.nan
df.loc[df['Population'] < 70] = np.nan

df
Out[35]:
Population GDP Surface Area HDI Continent
Canada NaN NaN NaN NaN NaN
France NaN NaN NaN NaN NaN
Germany 80.940 3874437.0 357114.0 0.916 Europe
Italy NaN NaN NaN NaN NaN
Japan 127.061 4602367.0 377930.0 0.891 Asia
United Kingdom NaN NaN NaN NaN NaN
United States 318.523 17348075.0 9525067.0 0.915 America

purple-divider

Notebooks AI
Notebooks AI Profile20060