Profile picture

3.2 - Categorical Ordering and CategoricalDType

Last updated: April 3rd, 20192019-04-03Project preview

rmotr


Categorical Ordering and CategoricalDType

We'll see how to create categories with an inherent order.

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd

green-divider

Categorical Ordering

Categories can be created with an inherent order. Let's first create two Series with some sample response possibilities (ratings) and some user responses:

In [ ]:
ratings = pd.Series([
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
])
In [ ]:
responses = pd.Series([
    'Very satisfied',
    'Neither satisfied nor dissatisfied',
    'Very satisfied',
    'Somewhat satisfied',
    'Very dissatisfied',
    'Neither satisfied nor dissatisfied',
])

Now we have many ways to convert the responses to a category dtype with inherent order:

 1) Create a pandas.Categorical object

In [ ]:
service_ratings = pd.Categorical(responses,
                                 categories=ratings)

service_ratings
In [ ]:
service_ratings.as_ordered()

 2) Create a pandas.Categorical object, with ordered=True

In [ ]:
service_ratings = pd.Categorical(responses,
                                 categories=ratings,
                                 ordered=True)

service_ratings

We can also explore our category:

In [ ]:
service_ratings.categories
In [ ]:
service_ratings.codes
In [ ]:
service_ratings.get_values()

green-divider

The cat accessor object

The most common approach is to construct a series from the pandas.Categorical object and then use the cat accessor to reference the categorical info:

In [ ]:
s = pd.Series(service_ratings)
In [ ]:
s.cat.codes
In [ ]:
s.sort_values()

green-divider

CategoricalDtype object

You can also create a series with the values, and create a CategoricalDtype object with info about categories:

In [ ]:
ratings = [
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
]
In [ ]:
from pandas.api.types import CategoricalDtype

cat_dtype = CategoricalDtype(categories=ratings,
                             ordered=True)

cat_dtype

Now assign that custom dtype to our Series:

In [ ]:
responses = [
    'Very satisfied',
    'Neither satisfied nor dissatisfied',
    'Very satisfied',
    'Somewhat satisfied',
    'Very dissatisfied',
    'Neither satisfied nor dissatisfied',
]
In [ ]:
s = pd.Series(responses, dtype=cat_dtype)

s

Another way to achieve the same structure is by doing:

In [ ]:
s = pd.Series(responses)

s = s.astype(cat_dtype)

s

We can also explore our category:

In [ ]:
s.sort_values()
In [ ]:
s.cat.codes
In [ ]:
s.cat.categories

green-divider

Categories in DataFrames

Categorical data in DataFrames behave in the same way. After all, each column is a Series.

We'll work with a dataset that contains self-reported clothing information, product categories, catalog sizes, customers’ measurements (etc.) from Rent the Runway, a unique platform that allows women to rent clothes for various occasions.

In [ ]:
runway = pd.read_json('data/runway.json')

runway.reset_index(drop=True, inplace=True)

runway.head()

We can see that body type, category, fit and rented for are clearly categories.

Let's convert that columns to category dtype.

In [ ]:
runway['body type'] = runway['body type'].astype('category')
runway['category'] = runway['category'].astype('category')
runway['fit'] = runway['fit'].astype('category')
runway['rented for'] = runway['rented for'].astype('category')
In [ ]:
runway['body type'].values
In [ ]:
runway['category'].values
In [ ]:
runway['fit'].values
In [ ]:
runway['rented for'].values

The fit column seems to be bad ordered, let's order it the right way:

In [ ]:
runway['fit'] = pd.Categorical(runway['fit'],
                               categories=['small', 'fit', 'large'],
                               ordered=True)
In [ ]:
runway['fit'].values
In [ ]:
runway['fit'].values.categories
In [ ]:
runway['fit'].value_counts()
In [ ]:
runway['fit'].values.codes

Order also works in DataFrames, but we need to reset the object type first

In [ ]:
runway['fit'] = runway['fit'].astype('object')

And then set the type again:

In [ ]:
fit_cat_dtype = CategoricalDtype(['small', 'fit', 'large'], ordered=True)

runway['fit'] = runway['fit'].astype(fit_cat_dtype)
In [ ]:
runway['fit'].cat.ordered
In [ ]:
runway.sort_values('fit')

purple-divider

Notebooks AI
Notebooks AI Profile20060