# 3.1 - Intro to Categorical

Last updated: April 3rd, 2019

# Intro to Categorical¶

Categorical Data represents a special data type. A field that can take only a limited number of distinct values. For example, Sex (M, F), Football Player Positions (GK, DF, MF, FW).

Sometimes, categorical data can have an order associated ("Please rate our service: Bad, Good, Excellent"). They're important to statistical analysis, but can't be operated on (you can't multiply categories, for example). Sometimes Categories can accept "empty values" (represented with np.nan) and sometimes that's not allowed.

To save memory space and speed up computations, categories are "coded". For example, Sex M, F can be represented as 0, 1 internally.

## Hands on!¶

In [ ]:
import numpy as np
import pandas as pd


We'll work with a dataset that contains self-reported clothing information, product categories, catalog sizes, customers’ measurements (etc.) from Rent the Runway, a unique platform that allows women to rent clothes for various occasions.

In [ ]:
runway = pd.read_json('data/runway.json')

runway.reset_index(drop=True, inplace=True)



### Categoricals¶

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values, categories.

Common examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

Categories are Series or DataFrame columns (also Series) and are useful in the following cases:

• A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
• The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
• As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

### Creating a category¶

To create a Series from scratch as categorical you can do:

In [ ]:
s = pd.Series(['M', 'F', 'F', 'M', 'F', 'M', 'M'], dtype='category')

s


We can notice that fit feedback column on our dataset belongs to one of three classes: small, fit and large.

You can cast an existing column to the specified categorical dtype using astype():

In [ ]:
runway['fit'].unique()

In [ ]:
runway['fit'] = runway['fit'].astype('category')

runway['fit'].unique()


In upcoming lessons we'll see how to order that categories, don't worry for that now.

### Showing category information¶

We can also access the values within our category:

In [ ]:
runway['fit'].values


Of the following categories:

In [ ]:
runway['fit'].values.categories


The category dtype internally encode each value as:

In [ ]:
runway['fit'].values.codes

In [ ]:
runway['fit'].value_counts()

In [ ]:
runway['fit'].unique()


### Plotting category information¶

We can make a basic categorical plot using default pandas barplot:

In [ ]:
runway['fit'].value_counts().plot(kind='bar')


Also we can use seaborn and make a countplot with that categorical data:

In [ ]:
import seaborn as sns

sns.countplot(x='fit', data=runway)


### Comparing Memory Usage¶

It's easy to see the efficiency of Categorical Types. We'll create two Series: s_cat (containing a Categorical type) and s_obj (containing Strings, or objects). Both will have the same (1000) values generated randomly :

In [ ]:
values = np.random.randint(5, size=1000)

In [ ]:
labels = pd.Series([
'Very dissatisfied',
'Somewhat dissatisfied',
'Neither satisfied nor dissatisfied',
'Somewhat satisfied',
'Very satisfied'
])

In [ ]:
s_cat = pd.Series(
pd.Categorical.from_codes(
values, labels, ordered=True))

In [ ]:
s_obj = labels.take(values)

In [ ]:
s_cat.value_counts()


The total space taken by our s_obj series:

In [ ]:
s_obj.nbytes


The total space taken by our s_cat series:

In [ ]:
s_cat.nbytes


s_cat is 7 times smaller in bytes (values stored). Total memory usage is small too:

In [ ]:
s_cat.memory_usage(False)

In [ ]:
s_obj.memory_usage(False, )