Intro to Categorical¶
Categorical Data represents a special data type. A field that can take only a limited number of distinct values. For example, Sex (M
, F
), Football Player Positions (GK
, DF
, MF
, FW
).
Sometimes, categorical data can have an order associated ("Please rate our service: Bad
, Good
, Excellent
"). They're important to statistical analysis, but can't be operated on (you can't multiply categories, for example). Sometimes Categories can accept "empty values" (represented with np.nan
) and sometimes that's not allowed.
To save memory space and speed up computations, categories are "coded". For example, Sex M
, F
can be represented as 0
, 1
internally.
Hands on!¶
import numpy as np
import pandas as pd
We'll work with a dataset that contains self-reported clothing information, product categories, catalog sizes, customers’ measurements (etc.) from Rent the Runway, a unique platform that allows women to rent clothes for various occasions.
runway = pd.read_json('data/runway.json')
runway.reset_index(drop=True, inplace=True)
runway.head()
Categoricals¶
Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values, categories.
Common examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.
Categories are Series
or DataFrame
columns (also Series
) and are useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
s = pd.Series(['M', 'F', 'F', 'M', 'F', 'M', 'M'], dtype='category')
s
We can notice that fit
feedback column on our dataset belongs to one of three classes: small
, fit
and large
.
You can cast an existing column to the specified categorical dtype using astype()
:
runway['fit'].unique()
runway['fit'] = runway['fit'].astype('category')
runway['fit'].unique()
In upcoming lessons we'll see how to order that categories, don't worry for that now.
runway['fit'].values
Of the following categories:
runway['fit'].values.categories
The category
dtype internally encode each value as:
runway['fit'].values.codes
runway['fit'].value_counts()
runway['fit'].unique()
runway['fit'].value_counts().plot(kind='bar')
Also we can use seaborn and make a countplot
with that categorical data:
import seaborn as sns
sns.countplot(x='fit', data=runway)
Comparing Memory Usage¶
It's easy to see the efficiency of Categorical Types. We'll create two Series
: s_cat
(containing a Categorical
type) and s_obj
(containing Strings, or objects). Both will have the same (1000) values generated randomly :
values = np.random.randint(5, size=1000)
labels = pd.Series([
'Very dissatisfied',
'Somewhat dissatisfied',
'Neither satisfied nor dissatisfied',
'Somewhat satisfied',
'Very satisfied'
])
s_cat = pd.Series(
pd.Categorical.from_codes(
values, labels, ordered=True))
s_obj = labels.take(values)
s_cat.value_counts()
The total space taken by our s_obj
series:
s_obj.nbytes
The total space taken by our s_cat
series:
s_cat.nbytes
s_cat
is 7 times smaller in bytes (values stored). Total memory usage is small too:
s_cat.memory_usage(False)
s_obj.memory_usage(False, )