Profile picture

Pandas DataFrames - Selection and Indexing

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Pandas DataFrames - Selection and Indexing

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [2]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [3]:
df
Out[3]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America

green-divider

Direct indexation and selection

Quick overview:

Command Behaviour
df[col_key] Selection of column by name if it's labeled, else selects by position
df[[col_key_1, ..., col_key_N]] Selection of multiple columns by name if it's labeled, else selects by position
df.loc[row_key_1:row_key_2] Selection of multiple rows by name if it's labeled, else selects by position
df.loc[condition] Conditional selection (boolean selection)

 Indexing by column

Individual columns in the DataFrame can be selected with regular indexing. Each column is represented as a Series.

In [4]:
df['Population']
Out[4]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64
In [5]:
df.Population
Out[5]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

Note that the index of the returned Series is the same as the DataFrame one. And its name is the name of the column. If you're working on a notebook and want to see a more DataFrame-like format you can use the to_frame method:

In [6]:
df['Population'].to_frame()
Out[6]:
Population
Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523

Multiple columns can also be specified:

In [7]:
df[['Population', 'GDP']]
Out[7]:
Population GDP
Canada 35.467 1785387
France 63.951 2833687
Germany 80.940 3874437
Italy 60.665 2167744
Japan 127.061 4602367
United Kingdom 64.511 2950039
United States 318.523 17348075

green-divider

Indexing with loc method, using labeled index keys

loc method selects matching the given key(s). Quick overview:

Command Behaviour
df.loc[key] Select by row index
df.loc[key_1:key_2] Select by row index
df.loc[[key_1, ..., key_N]] Select by row index
df.loc[sel1, sel2] Select by row index (sel1) y column index (sel2). Selectors: position, slice, sequence, or condition
df.loc[condition] Select by row index

Row level selection works better with loc and iloc which are recommended over regular "direct slicing" (df[:]).

In [8]:
df
Out[8]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [9]:
df.loc['France']
Out[9]:
Population       63.951
GDP             2833687
Surface Area     640679
HDI               0.888
Continent        Europe
Name: France, dtype: object
In [10]:
df.loc['France'].to_frame()
Out[10]:
France
Population 63.951
GDP 2833687
Surface Area 640679
HDI 0.888
Continent Europe
In [11]:
df.loc['France'].to_frame().T
Out[11]:
Population GDP Surface Area HDI Continent
France 63.951 2833687 640679 0.888 Europe
In [12]:
df.loc['Germany']
Out[12]:
Population        80.94
GDP             3874437
Surface Area     357114
HDI               0.916
Continent        Europe
Name: Germany, dtype: object
In [13]:
df.loc['France':'Germany']
Out[13]:
Population GDP Surface Area HDI Continent
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
In [14]:
df.loc[['France', 'Italy']]
Out[14]:
Population GDP Surface Area HDI Continent
France 63.951 2833687 640679 0.888 Europe
Italy 60.665 2167744 301336 0.873 Europe

 Indexing by position (row index) and column

As a second "argument", you can pass the column(s) you'd like to select:

In [15]:
df.loc['France': 'Italy', 'Population']
Out[15]:
France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64
In [16]:
df.loc['France': 'Italy', ['Population', 'GDP']]
Out[16]:
Population GDP
France 63.951 2833687
Germany 80.940 3874437
Italy 60.665 2167744
In [17]:
#df['Population']
df.loc[:, 'Population']
Out[17]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64
In [18]:
#df[['Population', 'GDP']]
df.loc[:, ['Population', 'GDP']]
Out[18]:
Population GDP
Canada 35.467 1785387
France 63.951 2833687
Germany 80.940 3874437
Italy 60.665 2167744
Japan 127.061 4602367
United Kingdom 64.511 2950039
United States 318.523 17348075

green-divider

Indexing by iloc method

iloc method works with the sequential (numeric) "position" of the index. Quick overview:

Command Behaviour
df.iloc[num_val] Select by row position
df.iloc[num_val1:num_val2] Select by row position
df.iloc[sel1, sel2] Select by row position (sel1) y column position (sel2). Selectors: position, slice or sequence
df.iloc[[num_val1,...,num_valn]] Select by row position
In [19]:
df
Out[19]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [20]:
# Canada
df.iloc[0]
Out[20]:
Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object
In [21]:
# United States
df.iloc[-1]
Out[21]:
Population       318.523
GDP             17348075
Surface Area     9525067
HDI                0.915
Continent        America
Name: United States, dtype: object
In [22]:
df.iloc[[0, 1, -1]]
Out[22]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
United States 318.523 17348075 9525067 0.915 America
In [23]:
df.iloc[1:3]
Out[23]:
Population GDP Surface Area HDI Continent
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe

 Indexing by position (row index) and column

In [24]:
df.iloc[1:3, 3]
Out[24]:
France     0.888
Germany    0.916
Name: HDI, dtype: float64
In [25]:
df.iloc[1:3, [0, 3]]
Out[25]:
Population HDI
France 63.951 0.888
Germany 80.940 0.916
In [26]:
df.iloc[1:3, 1:3]
Out[26]:
GDP Surface Area
France 2833687 640679
Germany 3874437 357114

RECOMMENDED: Always use loc and iloc to reduce ambiguity, specially with DataFrames with numeric indexes.

green-divider

Intro to Conditional selection (Boolean selection)

The Conditional selection we saw on Series is also applicable on DataFrames.

We can make a selection within our DataFrame using a list of boolean values:

In [27]:
df
Out[27]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America
In [28]:
df.loc[[
    False, # CA
    True,  # Fr
    True,  # GE
    True,  # IT
    False, # JA
    False, # UK
    False  #US
]]
Out[28]:
Population GDP Surface Area HDI Continent
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
In [29]:
rows_condition = [False, True, True, True, False, False, False]
cols_condition = [False, False, True, True, False]

df.loc[rows_condition, cols_condition]
Out[29]:
Surface Area HDI
France 640679 0.888
Germany 357114 0.916
Italy 301336 0.873

On upcoming lectures we'll see how to use more complex conditional selections.

purple-divider

Notebooks AI
Notebooks AI Profile20060