# Category encoding and Dummy Variables¶

In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values which represent various traits. Some examples include gender (“Male”, “Female”, “Other”), size (“Small”, “Medium”, “Large”), etc.

Regardless of what the value is used for, the challenge is determining how to use this data in the analysis. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.

## Hands on!¶

```
import numpy as np
import pandas as pd
```

We'll work with a dataset that contains self-reported clothing information, product categories, catalog sizes, customers’ measurements (etc.) from Rent the Runway, a unique platform that allows women to rent clothes for various occasions.

```
runway = pd.read_json('data/runway.json')
runway.reset_index(drop=True, inplace=True)
runway.head()
```

### Encoding Categories¶

Storing categories based on their codes is more space efficient than doing it with their actual values. We can translate back and forth categories to codes and vice versa. Assuming results from a survey, as our previous example, we could have the following values:

```
fit_order = ['small', 'fit', 'large']
```

```
from pandas.api.types import CategoricalDtype
cat_dtype = CategoricalDtype(categories=fit_order,
ordered=True)
cat_dtype
```

```
runway['fit'] = runway['fit'].astype(cat_dtype)
```

```
fit_values = runway['fit'].values
fit_values
```

```
fit_categories = runway['fit'].cat.categories
fit_categories
```

```
runway['fit_encoded'] = pd.Series(runway['fit'].values.codes)
runway[['fit', 'fit_encoded']][20:25]
```

We can get original labels from encoded values using `take`

:

```
fit_categories.take(runway['fit_encoded'])
```

To create a `Categorical`

object combining codes and labels you can use the `from_codes`

class method:

```
pd.Series(
pd.Categorical.from_codes(runway['fit_encoded'],
fit_categories,
ordered=True))
```

### Dummy Variables *(One-hot encoding)*¶

Categorical data can be also "expanded" into what's called as "Dummy Variables", also known as **One-hot encoding**.

This works by creating a new column per each possible value in the `DataFrame`

and marking the corresponding column with `0`

or `1`

.

Let's see an example:

```
df = pd.DataFrame({
'Name': ['John', 'Robert', 'Jane', 'Mary', 'Rose'],
'Sex': pd.Series(['M', 'M', 'F', 'F', 'F'],
dtype='category'),
})
```

```
df
```

```
pd.get_dummies(df['Sex'])
```

```
pd.concat([df, pd.get_dummies(df['Sex'])], axis=1)
```

Going back to our clothing dataframe, we can try to convert to apply one-hot encoding to the `rented for`

column:

```
runway['rented for'].head().to_frame()
```

We'll also add a `prefix`

to our new dummy variables:

```
rented_one_hot = pd.get_dummies(runway['rented for'],
prefix='rented_for')
rented_one_hot.head()
```

```
runway = pd.concat([runway, rented_one_hot], axis=1)
runway.head()
```

Finally, remove the old `rented_for`

column:

```
runway.drop(['rented for'],
axis='columns',
inplace=True)
runway.head()
```