Profile picture

3.3 - Category Encoding and Dummy Variables

Last updated: April 3rd, 20192019-04-03Project preview

rmotr


Category encoding and Dummy Variables

In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values which represent various traits. Some examples include gender (“Male”, “Female”, “Other”), size (“Small”, “Medium”, “Large”), etc.

Regardless of what the value is used for, the challenge is determining how to use this data in the analysis. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd

We'll work with a dataset that contains self-reported clothing information, product categories, catalog sizes, customers’ measurements (etc.) from Rent the Runway, a unique platform that allows women to rent clothes for various occasions.

In [ ]:
runway = pd.read_json('data/runway.json')

runway.reset_index(drop=True, inplace=True)

runway.head()

green-divider

Encoding Categories

Storing categories based on their codes is more space efficient than doing it with their actual values. We can translate back and forth categories to codes and vice versa. Assuming results from a survey, as our previous example, we could have the following values:

In [ ]:
fit_order = ['small', 'fit', 'large']
In [ ]:
from pandas.api.types import CategoricalDtype

cat_dtype = CategoricalDtype(categories=fit_order,
                             ordered=True)

cat_dtype
In [ ]:
runway['fit'] = runway['fit'].astype(cat_dtype)
In [ ]:
fit_values = runway['fit'].values
fit_values
In [ ]:
fit_categories = runway['fit'].cat.categories
fit_categories
In [ ]:
runway['fit_encoded'] = pd.Series(runway['fit'].values.codes)

runway[['fit', 'fit_encoded']][20:25]

We can get original labels from encoded values using take:

In [ ]:
fit_categories.take(runway['fit_encoded'])

To create a Categorical object combining codes and labels you can use the from_codes class method:

In [ ]:
pd.Series(
    pd.Categorical.from_codes(runway['fit_encoded'],
                              fit_categories,
                              ordered=True))

green-divider

Dummy Variables (One-hot encoding)

Categorical data can be also "expanded" into what's called as "Dummy Variables", also known as One-hot encoding.

This works by creating a new column per each possible value in the DataFrame and marking the corresponding column with 0 or 1.

Let's see an example:

In [ ]:
df = pd.DataFrame({
    'Name': ['John', 'Robert', 'Jane', 'Mary', 'Rose'],
    'Sex': pd.Series(['M', 'M', 'F', 'F', 'F'],
                     dtype='category'),
})
In [ ]:
df
In [ ]:
pd.get_dummies(df['Sex'])
In [ ]:
pd.concat([df, pd.get_dummies(df['Sex'])], axis=1)

Going back to our clothing dataframe, we can try to convert to apply one-hot encoding to the rented for column:

In [ ]:
runway['rented for'].head().to_frame()

We'll also add a prefix to our new dummy variables:

In [ ]:
rented_one_hot = pd.get_dummies(runway['rented for'],
                                prefix='rented_for')

rented_one_hot.head()
In [ ]:
runway = pd.concat([runway, rented_one_hot], axis=1)

runway.head()

Finally, remove the old rented_for column:

In [ ]:
runway.drop(['rented for'],
            axis='columns',
            inplace=True)

runway.head()

purple-divider

Notebooks AI
Notebooks AI Profile20060