Profile picture

4.4 - Grouping and Aggregation

Last updated: April 3rd, 20192019-04-03Project preview

rmotr


Grouping and Aggregation

An essential piece of analysis of large data is efficient summarizaion: computing aggregations like sum(), mean(), min(), max(), in which a single number gives insight into the nature of a potentially large dataset.

purple-divider

Hands on!

In [ ]:
import pandas as pd
import numpy as np

green-divider

NBA players data

We'll use a small NBA dataset which gives information about its players.

In [ ]:
players = pd.DataFrame({
    'salary': [
        33285709, 31269231, 34682550, 25000000, 17826150,
        29512900, 28530608, 26243760, 18868625, 2500000
    ],
    'season_start': [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017],
    'season_end': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018],
    'team': ['CLE', 'DEN', 'GSW', 'GSW', 'GSW', 'LAC', 'OKC', 'OKC', 'SAS', 'SAS'],
    'Pos': ['SF', 'PF', 'PG', 'PF', 'SG', 'PF', 'PG', 'SF', 'SF', 'SG'],
    'Age': [32.0, 31.0, 28.0, 28.0, 26.0, 27.0, 28.0, 32.0, 25.0, 39.0]
}, index=[
    'LeBron James', 'Paul Millsap', 'Stephen Curry', 'Kevin Durant',
    'Klay Thompson', 'Blake Griffin', 'Russell Westbrook',
    'Carmelo Anthony', 'Kawhi Leonard', 'Manu Ginobili'
])
In [ ]:
players.sort_index(inplace=True)
In [ ]:
players
In [ ]:
players.info()

green-divider

Understanding Grouping

We can manually simulate the Split > Apply > Combine operation.

1. Split (split players by positions)

This step involves breaking up and grouping a DataFrame depending on the value of the specified key.

In [ ]:
players['Pos'].value_counts()

Position = 'PF'

In [ ]:
players[players['Pos'] == 'PF']

Position = 'SF'

In [ ]:
players[players['Pos'] == 'SF']

Position = 'PG'

In [ ]:
players[players['Pos'] == 'PG']

2. Apply (apply a function to each group)

This step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.

For example, get the max salary per group (per position):

In [ ]:
top_PF_salary = players[players['Pos'] == 'PF'].sort_values(by='salary', ascending=False)

top_PF_salary = top_PF_salary.head(1)[['Pos', 'salary']]

top_PF_salary
In [ ]:
top_SF_salary = players[players['Pos'] == 'SF'].sort_values(by='salary', ascending=False)

top_SF_salary = top_SF_salary.head(1)[['Pos', 'salary']]

top_SF_salary
In [ ]:
top_PG_salary = players[players['Pos'] == 'PG'].sort_values(by='salary', ascending=False)

top_PG_salary = top_PG_salary.head(1)[['Pos', 'salary']]

top_PG_salary

3. Combine

This step merges the results of these operations into an output array.

In [ ]:
pd.concat([top_PF_salary, top_SF_salary, top_PG_salary])

green-divider

The GroupBy object

We can compute the most basic split-apply-combine operation with the groupby() method of DataFrames, passing the name of the desired key column:

In [ ]:
players['salary'].groupby(players['Pos'])

green-divider

Aggregation

The GroupBy object does no actual computation until the aggregation is applied to it. To produce a result, we can apply an aggregate to that object which will perform the appropiate apply/combine steps to produce the desired result.

This is usually referred to as an aggregation, you have a set of values (for example, all the salaries of the players in position PG) and you want to "aggregate" them or "reduce" them to a single value that has meaning for you. In this case, the aggregation is the max function. For each group of players (split by position), we want to get the max salary:

In [ ]:
players['salary'].groupby(players['Pos']).max().to_frame()

The max function in this case, is the applied function, we could apply any other different function, for example, mean to get the average salary per position:

In [ ]:
players['salary'].groupby(players['Pos']).mean().to_frame()
In [ ]:
players['salary'].groupby(players['Pos']).describe()

We could verify it works manually:

In [ ]:
players[players['Pos'] == 'PG'].sort_values(by='salary', ascending=False)
In [ ]:
players.loc[players['Pos'] == 'PG', 'salary'].mean()

The result of the groupby operation is a Series, indexed by the group and containing the aggregation result as the values. What happens if you want to group by multiple parameters?

In [ ]:
players['salary'].groupby([players['team'], players['Pos']]).max().to_frame()

The result is also a Series, but with a hierarchical index.

In the case of grouping by multiple features, order matters. Changing the order of the keys to use to group will alter the resulting Series:

In [ ]:
players['salary'].groupby([players['Pos'], players['team']]).max().to_frame()

Although the values will need to match in the common keys. For example, the max salary of "Team GSW, position PF" is the same as max salary of "position PF, team GSW".

green-divider

Custom aggregations

Most common aggregation functions are already provided (min, max, mean, etc.). But you can also use other custom functions with the aggregate method:

In [ ]:
players.groupby([players['team']]).aggregate(np.sum)

As you can see, aggregate is running the function in every column, even in those that don't make sense (like season_end or season_start). We can, of course, select the columns to use beforehand:

In [ ]:
players[['salary', 'Age']].groupby([players['team']]).aggregate(np.max)

Or...

In [ ]:
players.groupby([players['team']])[['salary', 'Age']].aggregate(np.max)

aggregate and multiple parameters

Grouping by multiple parameters also works with aggregate:

In [ ]:
players[['salary', 'Age']].groupby([players['team'], players['Pos']]).aggregate(np.max)

 Multiple functions

aggregate can also take multiple functions to apply to the groups:

In [ ]:
def range_max_min(values):
    return values.max() - values.min()
In [ ]:
players[['salary', 'Age']].groupby([players['team']]).aggregate([np.max, np.min, range_max_min])

 Dict of operations

You can also pass to aggregate a dictionary containing different operations to perform, based on the column:

In [ ]:
players[['salary', 'Age']].groupby([players['team']]).aggregate({
    'salary': np.max,
    'Age': np.mean
})

green-divider

The apply() method

apply is also a method available to run custom operations on your groups. It's similar to aggregate, so we usually recommend the later one over apply:

In [ ]:
players['salary'].groupby([players['team']]).apply(np.max).to_frame()
In [ ]:
players['salary'].groupby([players['team']]).apply(lambda x: x.max() - x.min()).to_frame()

green-divider

More examples using the whole data

Now we'll use the complete dataset and try more grouping and aggregations

In [ ]:
df = pd.read_csv('data/nba_2017_cleaned.csv', index_col=1, dtype={
    'Tm': 'category',
    'Pos': 'category'
})
In [ ]:
df.head()
In [ ]:
df.info()
In [ ]:
df['FG'].groupby(df['Pos']).mean()
In [ ]:
df.groupby('Pos').mean()
In [ ]:
df['FG'].groupby(df['Pos']).size().to_frame()
In [ ]:
df.groupby('Pos').size().to_frame()

We can also pass custom functions:

In [ ]:
df['FG'].groupby(df['Pos']).apply(np.mean).to_frame()

purple-divider

Notebooks AI
Notebooks AI Profile20060