Profile picture

4.6 - Creating Groups From Data Using Transform

Last updated: April 3rd, 20192019-04-03Project preview

rmotr


Creating Groups from Data using Transform

If our data is continuous, we can also create groups with a few different mechanisms:

purple-divider

Hands on!

In [ ]:
import pandas as pd
import numpy as np

green-divider

 NBA players data

We'll use a small NBA dataset which gives information about its players.

In [ ]:
df = pd.read_csv('data/nba_small_demo.csv', index_col=0)

df

green-divider

qcut & cut

We've already seen qcut and cut, but as a reminder:

  • qcut: Is used to discretize a given variable into equal-size buckets based on rand or based on sample quantiles. So, when you ask for quantiles with qcut, the bins will be chosen so that you have the same number of values in each bin.
  • cut: Using cut bins will be evenly spaced according to the values themselves and not the frequency of those values.
In [ ]:
pd.qcut(df['salary'], 3, labels=['Low Salary', 'Mid Range', 'High Salary']).to_frame()
In [ ]:
df['Salary Range'] = pd.qcut(df['salary'], 3, labels=['Low Salary', 'Mid Range', 'High Salary'])
In [ ]:
df

green-divider

More flexibility with apply

Running a custom function over each value can give you a lot more flexibility when you need to create your groups. For example, let's divide our players in "older than 30 y/o or not":

In [ ]:
def older_than_30(x):
    if x['Age'] >= 30:
        return 1
    else:
        return 0
In [ ]:
df.apply(older_than_30, axis=1).to_frame()
In [ ]:
df['Older than 30'] = df.apply(older_than_30, axis=1)
df
In [ ]:
df.sort_values('Pos')
In [ ]:
df['salary'].groupby([df['Pos'], df['Older than 30']]).max().to_frame()

green-divider

Transform

Sometimes you need to combine "group-wise" operations with "element-wise" operations. Usually, when you need to compare an individual with some property of the group it belongs to. For example, let's analyze each players salary with respect to their Position. For example, how much a player makes compared to the highest paid player in his position. Let's start with all the positions and the max value of each one of them:

In [ ]:
df['salary'].groupby(df['Pos']).max().to_frame().sort_index()
In [ ]:
df.sort_values('Pos')

For example, the highest salary in the position PF is $31,269,231 ("Paul Millsap"). Let's subtract the salary of the rest of the players in the same position:

In [ ]:
df.loc[df['Pos'] == 'PF', 'salary'] - 31269231

Again, we're comparing a single individual with its own group. In this case, the highest salary in that individual's position. But in this case, we've hardcoded the max salary in the PF position. How can we do it dynamically? The answer is transform:

In [ ]:
 df['salary'].groupby(df['Pos']).transform('max').to_frame()

Transform takes a function to apply to each group, but it "broadcasts" the result to ALL the individuals of that group. Instead of reducing the results we get a result of the same size as the original data. This makes combining the data back super simple.

Maybe it'll look cleaner if we add it as another column in the original DataFrame:

In [ ]:
df['Max salary by Position'] = df['salary'].groupby(df['Pos']).transform('max')
In [ ]:
df.sort_values('Pos')

As you can see, all the players in position PF have the same value under Max salary by Position, the max salary found in that position, which is, as we previously saw: $31,269,231 ("Paul Millsap").

So now, we could just subtract each individual's salary to the max of their group:

In [ ]:
df['salary'] - df['Max salary by Position']

We didn't need to store 'Max salary by Position' in the DataFrame to achieve these results. We could have done it all in one line:

In [ ]:
df['salary'] - df['salary'].groupby(df['Pos']).transform('max')
In [ ]:
def my_f(arg):
    assert False, arg
In [ ]:
df['Max salary by Position'] = df['salary'].groupby(df['Pos']).transform(my_f)
In [ ]:
df

purple-divider

Notebooks AI
Notebooks AI Profile20060