Profile picture

3.1 - Intro to Categorical

Last updated: November 25th, 20192019-11-25Project preview

rmotr


Intro to Categorical

Categorical Data represents a special data type. A field that can take only a limited number of distinct values. For example, Sex (M, F), Football Player Positions (GK, DF, MF, FW).

Sometimes, categorical data can have an order associated ("Please rate our service: Bad, Good, Excellent"). They're important to statistical analysis, but can't be operated on (you can't multiply categories, for example). Sometimes Categories can accept "empty values" (represented with np.nan) and sometimes that's not allowed.

To save memory space and speed up computations, categories are "coded". For example, Sex M, F can be represented as 0, 1 internally.

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd

We'll work with a dataset that contains self-reported clothing information, product categories, catalog sizes, customers’ measurements (etc.) from Rent the Runway, a unique platform that allows women to rent clothes for various occasions.

In [ ]:
runway = pd.read_json('data/runway.json')

runway.reset_index(drop=True, inplace=True)

runway.head()

green-divider

Categoricals

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values, categories.

Common examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

Categories are Series or DataFrame columns (also Series) and are useful in the following cases:

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
  • The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
  • As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

green-divider

Creating a category

To create a Series from scratch as categorical you can do:

In [ ]:
s = pd.Series(['M', 'F', 'F', 'M', 'F', 'M', 'M'], dtype='category')

s

We can notice that fit feedback column on our dataset belongs to one of three classes: small, fit and large.

You can cast an existing column to the specified categorical dtype using astype():

In [ ]:
runway['fit'].unique()
In [ ]:
runway['fit'] = runway['fit'].astype('category')

runway['fit'].unique()

In upcoming lessons we'll see how to order that categories, don't worry for that now.

green-divider

Showing category information

We can also access the values within our category:

In [ ]:
runway['fit'].values

Of the following categories:

In [ ]:
runway['fit'].values.categories

The category dtype internally encode each value as:

In [ ]:
runway['fit'].values.codes
In [ ]:
runway['fit'].value_counts()
In [ ]:
runway['fit'].unique()

green-divider

Plotting category information

We can make a basic categorical plot using default pandas barplot:

In [ ]:
runway['fit'].value_counts().plot(kind='bar')

Also we can use seaborn and make a countplot with that categorical data:

In [ ]:
import seaborn as sns

sns.countplot(x='fit', data=runway)

green-divider

Comparing Memory Usage

It's easy to see the efficiency of Categorical Types. We'll create two Series: s_cat (containing a Categorical type) and s_obj (containing Strings, or objects). Both will have the same (1000) values generated randomly :

In [ ]:
values = np.random.randint(5, size=1000)
In [ ]:
labels = pd.Series([
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
])
In [ ]:
s_cat = pd.Series(
    pd.Categorical.from_codes(
        values, labels, ordered=True))
In [ ]:
s_obj = labels.take(values)
In [ ]:
s_cat.value_counts()

The total space taken by our s_obj series:

In [ ]:
s_obj.nbytes

The total space taken by our s_cat series:

In [ ]:
s_cat.nbytes

s_cat is 7 times smaller in bytes (values stored). Total memory usage is small too:

In [ ]:
s_cat.memory_usage(False)
In [ ]:
s_obj.memory_usage(False, )

purple-divider

Notebooks AI
Notebooks AI Profile20060