Profile picture

Pandas DataFrame Query Method

Last updated: November 29th, 20192019-11-29Project preview

rmotr


Pragmatic filtering with DataFrame.query()

We'll explore another way of filtering data using the query method of DataFrames.

IMPORTANT: The query method is only functional on Pandas >= 0.25. Check out your Pandas version:

In [1]:
import pandas as pd
import numpy as np

The query method is available since Pandas 0.25, make sure your version is >= 0.25.0:

In [2]:
pd.__version__
Out[2]:
'0.25.3'

purple-divider

Hands on!

In [3]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [4]:
df
Out[4]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America

The query() method

DataFrame.query allows us to write queries and filters with a more "human readable" approach than regular boolean arrays. It's still necessary (and important) to understand Boolean Arrays, but query will make some filters easier to perform.

In its simplest form, query receives just one parameter: the expression to query for:

In [5]:
df.query('Population > 70')
Out[5]:
Population GDP Surface Area HDI Continent
Germany 80.940 3874437 357114 0.916 Europe
Japan 127.061 4602367 377930 0.891 Asia
United States 318.523 17348075 9525067 0.915 America
In [6]:
df.loc[df['Population'] > 70]
Out[6]:
Population GDP Surface Area HDI Continent
Germany 80.940 3874437 357114 0.916 Europe
Japan 127.061 4602367 377930 0.891 Asia
United States 318.523 17348075 9525067 0.915 America

We can access column names directly within the expression, and literals (numbers, strings, etc) are still literals; just make sure to surround strings with the correct set of quotes:

In [7]:
df.query('Continent == "Asia"')
Out[7]:
Population GDP Surface Area HDI Continent
Japan 127.061 4602367 377930 0.891 Asia
In [9]:
df.loc[df['Continent'] == 'Asia']
Out[9]:
Population GDP Surface Area HDI Continent
Japan 127.061 4602367 377930 0.891 Asia

Accessing your local environment

The query() method can also access variables defined in the local scope; just prepend them with the @ symbol:

In [10]:
max_population = 70
In [11]:
df.query('Population > @max_population')
Out[11]:
Population GDP Surface Area HDI Continent
Germany 80.940 3874437 357114 0.916 Europe
Japan 127.061 4602367 377930 0.891 Asia
United States 318.523 17348075 9525067 0.915 America
In [12]:
desired_continent = 'Asia'
In [13]:
df.query('Continent == @desired_continent')
Out[13]:
Population GDP Surface Area HDI Continent
Japan 127.061 4602367 377930 0.891 Asia

Columns names that contain white spaces

If a column name contain white spaces, just surround it with backticks:

`Column Name`
In [14]:
df
Out[14]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [15]:
df.query('`Surface Area` >= 500_000')
Out[15]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
United States 318.523 17348075 9525067 0.915 America

Boolean operators

If a column name contain white spaces, just surround it with backticks:

`Column Name`
In [ ]:
df.query('`Surface Area` >= 500_000 & Continent == "Europe"')

Parentheses can be optionally applied for readability purposes (or to reduce ambiguity):

In [ ]:
df.query('(`Surface Area` >= 500_000) & (Continent == "Europe")')

And there's full support for and, or and not operators:

In [ ]:
df.query('(`Surface Area` >= 500_000) & not (Continent == "Europe")')

purple-divider

Notebooks AI
Notebooks AI Profile20060