Profile picture

4.1 - Pandas Advanced Indexing and Pivot Tables

Last updated: November 25th, 20192019-11-25Project preview

rmotr


Advanced Indexing and Pivot tables

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

Examples of MultiIndexes

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd

green-divider

MultiIndexed Series

We'll start by considering how we might represent two-dimensional data within a one-dimensional pandas Series.

Let's create a Series with Max Temperatures from 2012 to 2017 of different cities:

In [ ]:
temps = pd.Series([
    30.7, 29.6, 29.8, 30.8, 31.3, 29.2,
    30.9, 31.4, 29.2, 29.9, 30.3, 30.7,
    21.3, 21.7, 22.4, 21.2, 20.3, 21.8
], index=[
    ['NYC', 'NYC', 'NYC', 'NYC', 'NYC', 'NYC',
     'TKY', 'TKY', 'TKY', 'TKY', 'TKY', 'TKY',
     'SF', 'SF', 'SF', 'SF', 'SF', 'SF'],
    [2012, 2013, 2014, 2015, 2016, 2017,
     2012, 2013, 2014, 2015, 2016, 2017,
     2012, 2013, 2014, 2015, 2016, 2017]
], name='Max Temperatures')
In [ ]:
temps.to_frame()

green-divider

 Selection and Slicing on MultiIndex Series

In [ ]:
temps['SF'].to_frame()
In [ ]:
temps['SF', 2016]
In [ ]:
temps['SF'].loc[2012:2015].to_frame()

Traditional slicing withing the multi-indexed series is possible, only if the index is sorted:

In [ ]:
temps.sort_index(inplace=True)

temps.to_frame()
In [ ]:
temps.loc['NYC':'SF', 2012:2015].to_frame()

The index of our series is of type MultiIndex:

In [ ]:
temps.index

We can create these indexes with several methods, for example:

In [ ]:
index = pd.MultiIndex.from_product([['NYC', 'TKY', 'SF'], [2012, 2013, 2014, 2015, 2016, 2017]])
In [ ]:
temps2 = pd.Series([
    30.7, 29.6, 29.8, 30.8, 31.3, 29.2,
    30.9, 31.4, 29.2, 29.9, 30.3, 30.7,
    21.3, 21.7, 22.4, 21.2, 20.3, 21.8
], index=index, name='Max Temperatures')
In [ ]:
temps2.to_frame()
In [ ]:
temps2.sort_index(inplace=True)
In [ ]:
temps.equals(temps2)

Each "index" in a MultiIndex is called a "level", and each level can have a name:

In [ ]:
temps.index.names = ['City', 'Year']
In [ ]:
temps.to_frame()

If you pay attention, the same data could have been structured as a DataFrame. It turns out, that both structures are related and you can go from one to the other with the unstack and stack methods:

In [ ]:
temps.unstack()
In [ ]:
temps.unstack(level=0)

The stack() methods constructs the MultiIndexed Series again:

In [ ]:
temps.unstack().stack().to_frame()
In [ ]:
temps.unstack(level=0).stack().to_frame()

green-divider

MultiIndexed DataFrames

It's also perfectly valid to construct a MultiIndexed DataFrame:

In [ ]:
s = """
-3.9
-3.6
-4.5
-5
-5.8
-4.2
0.7
1.1
0.2
-0.5
-1.2
0.8
5.8
6.6
4.9
6.4
5.5
7.2"""
In [ ]:
[float(v) for v in s.split('\n') if v]

Let's begin creating a Dataframe with City name as index:

In [ ]:
df = pd.DataFrame({
    'Max Temperatures': [
        30.7, 29.6, 29.8, 30.8, 31.3, 29.2,
        30.9, 31.4, 29.2, 29.9, 30.3, 30.7,
        21.3, 21.7, 22.4, 21.2, 20.3, 21.8
    ],
    'Min Temperatures': [
        -3.9, -3.6, -4.5, -5.0, -5.8, -4.2,
        0.7, 1.1, 0.2, -0.5, -1.2, 0.8,
        5.8, 6.6, 4.9, 6.4, 5.5, 7.2
    ],
    'Year': [
        2012, 2013, 2014, 2015, 2016, 2017,
        2012, 2013, 2014, 2015, 2016, 2017,
        2012, 2013, 2014, 2015, 2016, 2017
    ]
}, index=[
    'NYC', 'NYC', 'NYC', 'NYC', 'NYC', 'NYC',
    'TKY', 'TKY', 'TKY', 'TKY', 'TKY', 'TKY',
    'SF', 'SF', 'SF', 'SF', 'SF', 'SF'
])
In [ ]:
df.head()

Remove index

The first thing we'll do it remove the City index, and keep our DataFrame without indexes:

In [ ]:
df.reset_index(inplace=True)
In [ ]:
df.columns = ['City', 'Max Temperatures', 'Min Temperatures', 'Year']
In [ ]:
df.head()

green-divider

 Create a MultiIndex

Use City and Year to make a compound index.

With that hierarchical index we'll be able to make a total (with tuples) or partial (with one index element) indexation of our data.

In [ ]:
df.set_index(['City', 'Year'], inplace=True)
In [ ]:
df.head(10)

green-divider

 Selection and Slicing on MultiIndex Dataframes

In [ ]:
df.loc['NYC']
In [ ]:
df.loc[('NYC', 2012)].to_frame()
In [ ]:
df.loc[('NYC', 'Max Temperatures')].to_frame()
In [ ]:
df.loc[('NYC', (2012, 2015)), 'Max Temperatures'].to_frame()

More complex slicing needs to be performed with the IndexSlice constructor from pandas or the slice builtin function.

In [ ]:
df.sort_index(inplace=True)
In [ ]:
df.loc[('NYC', slice(2012, 2015)), 'Max Temperatures'].to_frame()
In [ ]:
df.loc[pd.IndexSlice['NYC', 2012: 2015], 'Max Temperatures'].to_frame()
In [ ]:
df.loc[(slice('NYC', 'SF'), slice(2012, 2015)), 'Max Temperatures'].to_frame()
In [ ]:
df.loc[pd.IndexSlice['NYC': 'SF', 2012: 2015], 'Max Temperatures'].to_frame()

Also, we can move last row index to column index using unstack() function.

In [ ]:
df = df.unstack()

df

On the other side, stack() function will move last column index to row index:

In [ ]:
df = df.stack()

df

green-divider

MultiColumns DataFrames

DataFrames can also have multiple columns:

In [ ]:
multicolumn = pd.DataFrame(
    np.random.randn(4, 4),
    index=[['NYC', 'NYC', 'SF', 'SF'], [2016, 2017, 2016, 2017]],
    columns=[
        ['Temperatures', 'Temperatures', 'Humidity', 'Humidity'],
        ['Max Temperature', 'Min Temperature', 'Max Humidity', 'Min Humidity']
    ]
)

multicolumn
In [ ]:
multicolumn.loc['NYC']
In [ ]:
multicolumn.loc['NYC', 2016].to_frame()
In [ ]:
multicolumn.loc[('NYC', 2016), ('Temperatures',)].to_frame()
In [ ]:
multicolumn.loc[('NYC', 2016), ('Temperatures', 'Max Temperature')]

It's not recommended to have so many nested Indexes and Columns.

green-divider

Data Aggregations on Multi-Indexes

We can use common aggregation functions such as sum(), mean() and max() on hierarchically indexed data:

In [ ]:
df
In [ ]:
df.index
In [ ]:
df.mean(level='City')

Another way to use aggregation functions is by using Pivot tables.

green-divider

Pivot tables

A pivot table is a table of statistics that summarizes the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.

In [ ]:
df_no_index = df.reset_index()

df_no_index.head()

Create a pivot table with just Continent, Country and Population from our data:

In [ ]:
pivot = df_no_index.pivot('City', 'Year', 'Max Temperatures')

pivot

green-divider

Creating a pivot table using indexes

We can achieve the same table using indexes and unstacking to move the last row index to column index:

In [ ]:
df_with_index = df_no_index.set_index(['City', 'Year'])

df_with_index
In [ ]:
pivot = df_with_index.unstack()

pivot

Finally, we keep only Max Temperatures column:

In [ ]:
pivot['Max Temperatures']

green-divider

Pivot table with aggregated values

Finally, another pivot tables use case is using an aggregation function to the values, so we can get aggregated results.

In this case we'll aggregate Max Temperatures and Min Temperatures to get the mean values per city:

In [ ]:
df_no_index.head()
In [ ]:
results = pd.pivot_table(df_no_index,
                         values=['Max Temperatures', 'Min Temperatures'],
                         index=['City'],
                         aggfunc=np.mean)

results.columns = ['Mean Max Temperature', 'Mean Min Temperature']

results

purple-divider

Notebooks AI
Notebooks AI Profile20060