Profile picture

Simulating Range Lookup Using Pandas

Last updated: July 24th, 20192019-07-24Project preview

First we'll read the Excel file for this lesson and analyze it:

In [18]:
import numpy as np
import pandas as pd
In [19]:
excel_file = pd.ExcelFile('3. Products.xlsx')
In [20]:
excel_file.sheet_names
Out[20]:
['Products list', 'Price categories list', 'Solution']

Now we'll read products and price categories sheets:

In [21]:
products = excel_file.parse('Products list')
In [22]:
products.head()
Out[22]:
product_id price merchant_id brand name
0 AVphzgbJLJeJML43fA0o 0.00 1001 Sanus Sanus VLF410B1 10-Inch Super Slim Full-Motion ...
1 AVpgMuGwLJeJML43KY_c 50.00 1002 Boytone Boytone - 2500W 2.1-Ch. Home Theater System - ...
2 AVpe9FXeLJeJML43zHrq 100.00 1001 DENAQ DENAQ - AC Adapter for TOSHIBA SATELLITE
3 AVpfVJXu1cnluZ0-iwTT 200.00 1001 DreamWave DreamWave - Tremor Portable Bluetooth Speaker ...
4 AVphUeKeilAPnD_x3-Be 244.01 1004 Yamaha NS-SP1800BL 5.1-Channel Home Theater System (B...
In [23]:
products.shape
Out[23]:
(1245, 5)
In [24]:
categories = excel_file.parse('Price categories list')
In [25]:
categories
Out[25]:
price price_category
0 0 A
1 50 B
2 100 C
3 200 D
4 500 E
5 1000 F
6 5000 G
7 10000 H
8 12000 I
9 15000 J
10 20000 K

Having our data loaded, now we'll try to assign to each product its price category using pandas cut() method:

green-divider

Creating categories

Sometimes we need to divide a field with a continuous range of data into discrete categories. For example, you might divide age of users as 0-14, 15-35, 36-60, +60.

Although not directly using grouping constructs, it is worth explaining the process of discretization of continuous data. Discretization is a means of slicing up continuous data into a set of "bins", where each bin represents a range of the continuous sample and the items are then placed into the appropriate bin—hence the term "binning".

Discretization in pandas is performed using the cut() method. To use .cut() we need to provide the "limits" of our bins. In our case, we'll use the categories from the Price categories list:

In [26]:
categories['price']
Out[26]:
0         0
1        50
2       100
3       200
4       500
5      1000
6      5000
7     10000
8     12000
9     15000
10    20000
Name: price, dtype: int64

The bins in this case will be:

  • 0-50
  • 50-100
  • 100-200
  • etc...

 But what is each interval (bin) representing?

  • An open interval does not include its endpoints, and is indicated with parentheses.
  • A closed interval is an interval which includes all its limit points, and is denoted with square brackets.

Example A

The following interval will have numbers from 1 to 50. The 0 is not included:

$$(0 ; 50]$$

Example B

The following interval will have numbers from 0 to 49. The 50 is not included:

$$[0 ; 50)$$

Creating categories

In this case we want to create categories where the rightmost edge of each interval is not included, as we saw on Example B, so we'll need to use the right parameter.

This will let us change whether each interval includes the rightmost edge or not.

Default Excel's range VLOOKUP is NOT including the rightmost edge.

So to achieve the same behaviour as Excel's range VLOOKUP we'll use right=False parameter.

In [40]:
products_categories = pd.cut(products['price'],
                             bins=categories['price'],
                             right=False)
In [41]:
products_categories.head(10)
Out[41]:
0        [0, 50)
1      [50, 100)
2     [100, 200)
3     [200, 500)
4     [200, 500)
5    [500, 1000)
6     [200, 500)
7     [200, 500)
8     [100, 200)
9      [50, 100)
Name: price, dtype: category
Categories (10, interval[int64]): [[0, 50) < [50, 100) < [100, 200) < [200, 500) ... [5000, 10000) < [10000, 12000) < [12000, 15000) < [15000, 20000)]

We see each product was assigned to an interval. There are a total of 10 intervals (price_categories).

We can go a step further and give better names to each interval by using the labels provided in the sheet.

To do that, we'll use the labels parameter within cut().

In [43]:
categories['price_category'][:-1]
Out[43]:
0    A
1    B
2    C
3    D
4    E
5    F
6    G
7    H
8    I
9    J
Name: price_category, dtype: object
In [44]:
products_categories = pd.cut(products['price'],
                             bins=categories['price'],
                             labels=categories['price_category'][:-1],
                             right=False)
In [45]:
products_categories.head(10)
Out[45]:
0    A
1    B
2    C
3    D
4    D
5    E
6    D
7    D
8    C
9    B
Name: price, dtype: category
Categories (10, object): [A < B < C < D ... G < H < I < J]

green-divider

Changing interval borders

With Pandas we can change whether each interval includes the rightmost edge or not with the right parameter.

Let's change it value and see what happens:

In [46]:
pd.cut(products['price'],
       bins=categories['price'],
       right=True).head()
Out[46]:
0           NaN
1       (0, 50]
2     (50, 100]
3    (100, 200]
4    (200, 500]
Name: price, dtype: category
Categories (10, interval[int64]): [(0, 50] < (50, 100] < (100, 200] < (200, 500] ... (5000, 10000] < (10000, 12000] < (12000, 15000] < (15000, 20000]]
In [47]:
including_right = pd.cut(products['price'],
                        bins=categories['price'],
                        labels=categories['price_category'][:-1],
                        right=True)
In [48]:
including_right.head()
Out[48]:
0    NaN
1      A
2      B
3      C
4      D
Name: price, dtype: category
Categories (10, object): [A < B < C < D ... G < H < I < J]

By changing the right parameter value some products will change its category:

In [49]:
pd.cut(products['price'],
       bins=categories['price'],
       right=False).head()
Out[49]:
0       [0, 50)
1     [50, 100)
2    [100, 200)
3    [200, 500)
4    [200, 500)
Name: price, dtype: category
Categories (10, interval[int64]): [[0, 50) < [50, 100) < [100, 200) < [200, 500) ... [5000, 10000) < [10000, 12000) < [12000, 15000) < [15000, 20000)]
In [50]:
including_left = pd.cut(products['price'],
                        bins=categories['price'],
                        labels=categories['price_category'][:-1],
                        right=False)
In [51]:
including_left.head()
Out[51]:
0    A
1    B
2    C
3    D
4    D
Name: price, dtype: category
Categories (10, object): [A < B < C < D ... G < H < I < J]

green-divider

Merging together products and price categories

Finally we'll add price categories to each product:

In [ ]:
products['category'] = including_left
In [ ]:
products.head()

purple-divider

Notebooks AI
Notebooks AI Profile20060