Profile picture

3.5 - Creating Categories With Cutting and Binning

Last updated: July 10th, 20192019-07-10Project preview

rmotr


Creating categories with Cutting and Binning

Sometimes we need to divide a field with a continuous range of data into discrete categories. For example, you might divide age of users as 0-14, 15-35, 36-60, +60.

Although not directly using grouping constructs, it is worth explaining the process of discretization of continuous data. Discretization is a means of slicing up continuous data into a set of "bins", where each bin represents a range of the continuous sample and the items are then placed into the appropriate bin—hence the term "binning".

Discretization in pandas is performed using the pd.cut() and pd.qcut() functions.

purple-divider

Hands on!

In [1]:
import pandas as pd
import numpy as np

green-divider

Define our sample data

In [2]:
ages = np.append(np.random.randint(0, 99, size=16), [14, 35, 60])
In [3]:
ages
Out[3]:
array([41, 20, 80, 24, 88, 35,  0, 73, 22, 57, 38, 29, 47, 72, 81,  6, 14,
       35, 60])
In [4]:
bins = [0, 14, 35, 60, 100]

green-divider

cut

When using cut, bins will be evenly spaced according to the values themselves and not the frequency of those values.

This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

In [5]:
categories = pd.cut(ages, bins)
In [6]:
categories
Out[6]:
[(35, 60], (14, 35], (60, 100], (14, 35], (60, 100], ..., (60, 100], (0, 14], (0, 14], (14, 35], (35, 60]]
Length: 19
Categories (4, interval[int64]): [(0, 14] < (14, 35] < (35, 60] < (60, 100]]
In [7]:
categories.codes
Out[7]:
array([ 2,  1,  3,  1,  3,  1, -1,  3,  1,  2,  2,  1,  2,  3,  3,  0,  0,
        1,  2], dtype=int8)
In [8]:
categories.categories
Out[8]:
IntervalIndex([(0, 14], (14, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')
In [9]:
categories.value_counts()
Out[9]:
(0, 14]      2
(14, 35]     6
(35, 60]     5
(60, 100]    5
dtype: int64
In [10]:
np.sort(ages)
Out[10]:
array([ 0,  6, 14, 20, 22, 24, 29, 35, 35, 38, 41, 47, 57, 60, 72, 73, 80,
       81, 88])

Mathematically speaking, the categories created have been split including the right value (for example, age 14 is included in the first category (0, 14]). You can change which one is the inclusive side with the right parameter. By default, right is True, which makes it the inclusive side.

In [11]:
lefty_cats = pd.cut(ages, bins, right=False)
In [12]:
lefty_cats.value_counts()
Out[12]:
[0, 14)      2
[14, 35)     5
[35, 60)     6
[60, 100)    6
dtype: int64

You can also pass labels to give better names to your bins:

In [13]:
categories = pd.cut(ages, bins, labels=['Age 0-14', 'Age 15-35', 'Age 36-60', '+60'])
In [14]:
categories
Out[14]:
[Age 36-60, Age 15-35, +60, Age 15-35, +60, ..., +60, Age 0-14, Age 0-14, Age 15-35, Age 36-60]
Length: 19
Categories (4, object): [Age 0-14 < Age 15-35 < Age 36-60 < +60]
In [15]:
categories.value_counts()
Out[15]:
Age 0-14     2
Age 15-35    6
Age 36-60    5
+60          5
dtype: int64

green-divider

qcut

But, what happens if you don't know how many bins you'll employ? You need to split the data in similar sized bins.

The qcut is used to discretize a given variable into equal-size bins, using quantiles and the distribution of the data. So, when you ask for quantiles with qcut, the bins will be chosen so that you have the same number of values in each bin.

In [16]:
pd.qcut(ages, 4).value_counts()
Out[16]:
(-0.001, 23.0]    5
(23.0, 38.0]      5
(38.0, 66.0]      4
(66.0, 88.0]      5
dtype: int64

In this case, qcut has chosen the bin "categories" for us, based on the distribution of the data.

purple-divider

Notebooks AI
Notebooks AI Profile20060