# Grouping and Aggregation¶

An essential piece of analysis of large data is efficient summarizaion: computing aggregations like `sum()`

, `mean()`

, `min()`

, `max()`

, in which a single number gives insight into the nature of a potentially large dataset.

## Hands on!¶

```
import pandas as pd
import numpy as np
```

### NBA players data¶

We'll use a small NBA dataset which gives information about its players.

```
players = pd.DataFrame({
'salary': [
33285709, 31269231, 34682550, 25000000, 17826150,
29512900, 28530608, 26243760, 18868625, 2500000
],
'season_start': [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017],
'season_end': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018],
'team': ['CLE', 'DEN', 'GSW', 'GSW', 'GSW', 'LAC', 'OKC', 'OKC', 'SAS', 'SAS'],
'Pos': ['SF', 'PF', 'PG', 'PF', 'SG', 'PF', 'PG', 'SF', 'SF', 'SG'],
'Age': [32.0, 31.0, 28.0, 28.0, 26.0, 27.0, 28.0, 32.0, 25.0, 39.0]
}, index=[
'LeBron James', 'Paul Millsap', 'Stephen Curry', 'Kevin Durant',
'Klay Thompson', 'Blake Griffin', 'Russell Westbrook',
'Carmelo Anthony', 'Kawhi Leonard', 'Manu Ginobili'
])
```

```
players.sort_index(inplace=True)
```

```
players
```

```
players.info()
```

### Understanding Grouping¶

We can manually simulate the **Split** > **Apply** > **Combine** operation.

#### 1. Split (split players by positions)¶

This step involves breaking up and grouping a DataFrame depending on the value of the specified key.

```
players['Pos'].value_counts()
```

**Position = 'PF'**

```
players[players['Pos'] == 'PF']
```

**Position = 'SF'**

```
players[players['Pos'] == 'SF']
```

**Position = 'PG'**

```
players[players['Pos'] == 'PG']
```

#### 2. Apply (apply a function to each group)¶

This step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.

For example, get the `max`

salary per group (per position):

```
top_PF_salary = players[players['Pos'] == 'PF'].sort_values(by='salary', ascending=False)
top_PF_salary = top_PF_salary.head(1)[['Pos', 'salary']]
top_PF_salary
```

```
top_SF_salary = players[players['Pos'] == 'SF'].sort_values(by='salary', ascending=False)
top_SF_salary = top_SF_salary.head(1)[['Pos', 'salary']]
top_SF_salary
```

```
top_PG_salary = players[players['Pos'] == 'PG'].sort_values(by='salary', ascending=False)
top_PG_salary = top_PG_salary.head(1)[['Pos', 'salary']]
top_PG_salary
```

#### 3. Combine¶

This step merges the results of these operations into an output array.

```
pd.concat([top_PF_salary, top_SF_salary, top_PG_salary])
```

### The GroupBy object¶

We can compute the most basic split-apply-combine operation with the `groupby()`

method of `DataFrames`

, passing the name of the desired key column:

```
players['salary'].groupby(players['Pos'])
```

### Aggregation¶

The GroupBy object does no actual computation until the aggregation is applied to it. To produce a result, we can apply an aggregate to that object which will perform the appropiate apply/combine steps to produce the desired result.

This is usually referred to as **an aggregation**, you have a set of values (for example, all the salaries of the players in position `PG`

) and you want to "aggregate" them or "reduce" them to a single value that has meaning for you. In this case, the aggregation is the `max`

function. For each *group of players* (split by position), we want to get the `max`

salary:

```
players['salary'].groupby(players['Pos']).max().to_frame()
```

The `max`

function in this case, is the applied function, we could apply any other different function, for example, `mean`

to get the average salary per position:

```
players['salary'].groupby(players['Pos']).mean().to_frame()
```

```
players['salary'].groupby(players['Pos']).describe()
```

We could verify it works manually:

```
players[players['Pos'] == 'PG'].sort_values(by='salary', ascending=False)
```

```
players.loc[players['Pos'] == 'PG', 'salary'].mean()
```

The result of the `groupby`

operation is a `Series`

, indexed by the group and containing the aggregation result as the values. What happens if you want to group by multiple parameters?

```
players['salary'].groupby([players['team'], players['Pos']]).max().to_frame()
```

The result is also a `Series`

, but with a hierarchical index.

In the case of grouping by multiple features, **order matters**. Changing the order of the keys to use to group will alter the resulting Series:

```
players['salary'].groupby([players['Pos'], players['team']]).max().to_frame()
```

Although the values will need to match in the common keys. For example, the max salary of "Team `GSW`

, position `PF`

" is the same as max salary of "position `PF`

, team `GSW`

".

### Custom aggregations¶

Most common aggregation functions are already provided (`min`

, `max`

, `mean`

, etc.). But you can also use other custom functions with the `aggregate`

method:

```
players.groupby([players['team']]).aggregate(np.sum)
```

As you can see, `aggregate`

is running the function in every column, even in those that don't make sense (like `season_end`

or `season_start`

). We can, of course, select the columns to use beforehand:

```
players[['salary', 'Age']].groupby([players['team']]).aggregate(np.max)
```

Or...

```
players.groupby([players['team']])[['salary', 'Age']].aggregate(np.max)
```

`aggregate`

and multiple parameters¶

Grouping by multiple parameters also works with `aggregate`

:

```
players[['salary', 'Age']].groupby([players['team'], players['Pos']]).aggregate(np.max)
```

#### Multiple functions¶

`aggregate`

can also take multiple functions to apply to the groups:

```
def range_max_min(values):
return values.max() - values.min()
```

```
players[['salary', 'Age']].groupby([players['team']]).aggregate([np.max, np.min, range_max_min])
```

#### Dict of operations¶

You can also pass to `aggregate`

a dictionary containing different operations to perform, based on the column:

```
players[['salary', 'Age']].groupby([players['team']]).aggregate({
'salary': np.max,
'Age': np.mean
})
```

### The `apply()`

method¶

`apply`

is also a method available to run custom operations on your groups. It's similar to `aggregate`

, so we usually recommend the later one over `apply`

:

```
players['salary'].groupby([players['team']]).apply(np.max).to_frame()
```

```
players['salary'].groupby([players['team']]).apply(lambda x: x.max() - x.min()).to_frame()
```

### More examples using the whole data¶

Now we'll use the complete dataset and try more grouping and aggregations

```
df = pd.read_csv('data/nba_2017_cleaned.csv', index_col=1, dtype={
'Tm': 'category',
'Pos': 'category'
})
```

```
df.head()
```

```
df.info()
```

```
df['FG'].groupby(df['Pos']).mean()
```

```
df.groupby('Pos').mean()
```

```
df['FG'].groupby(df['Pos']).size().to_frame()
```

```
df.groupby('Pos').size().to_frame()
```

We can also pass custom functions:

```
df['FG'].groupby(df['Pos']).apply(np.mean).to_frame()
```