Intro to Categorical¶
Categorical Data represents a special data type. A field that can take only a limited number of distinct values. For example, Sex (
F), Football Player Positions (
Sometimes, categorical data can have an order associated ("Please rate our service:
Excellent"). They're important to statistical analysis, but can't be operated on (you can't multiply categories, for example). Sometimes Categories can accept "empty values" (represented with
np.nan) and sometimes that's not allowed.
To save memory space and speed up computations, categories are "coded". For example, Sex
F can be represented as
import numpy as np import pandas as pd
We'll work with a dataset that contains self-reported clothing information, product categories, catalog sizes, customers’ measurements (etc.) from Rent the Runway, a unique platform that allows women to rent clothes for various occasions.
runway = pd.read_json('data/runway.json') runway.reset_index(drop=True, inplace=True) runway.head()
Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values, categories.
Common examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.
DataFrame columns (also
Series) and are useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
s = pd.Series(['M', 'F', 'F', 'M', 'F', 'M', 'M'], dtype='category') s
We can notice that
fit feedback column on our dataset belongs to one of three classes:
You can cast an existing column to the specified categorical dtype using
runway['fit'] = runway['fit'].astype('category') runway['fit'].unique()
In upcoming lessons we'll see how to order that categories, don't worry for that now.
Of the following categories:
category dtype internally encode each value as:
Also we can use seaborn and make a
countplot with that categorical data:
import seaborn as sns sns.countplot(x='fit', data=runway)
Comparing Memory Usage¶
It's easy to see the efficiency of Categorical Types. We'll create two
s_cat (containing a
Categorical type) and
s_obj (containing Strings, or objects). Both will have the same (1000) values generated randomly :
values = np.random.randint(5, size=1000)
labels = pd.Series([ 'Very dissatisfied', 'Somewhat dissatisfied', 'Neither satisfied nor dissatisfied', 'Somewhat satisfied', 'Very satisfied' ])
s_cat = pd.Series( pd.Categorical.from_codes( values, labels, ordered=True))
s_obj = labels.take(values)
The total space taken by our
The total space taken by our
s_cat is 7 times smaller in bytes (values stored). Total memory usage is small too: