FreeCodeCamp - Pandas Real Life Example

Last updated: April 23rd, 20202020-04-23Project preview

rmotr


Bike store sales

In this class we'll be analyzing sales made on bike stores.

Follow this data in a Google Spreadsheet

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

Loading our data:

In [2]:
!head data/sales_data.csv
Date,Day,Month,Year,Customer_Age,Age_Group,Customer_Gender,Country,State,Product_Category,Sub_Category,Product,Order_Quantity,Unit_Cost,Unit_Price,Profit,Cost,Revenue
2013-11-26,26,November,2013,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2015-11-26,26,November,2015,19,Youth (<25),M,Canada,British Columbia,Accessories,Bike Racks,Hitch Rack - 4-Bike,8,45,120,590,360,950
2014-03-23,23,March,2014,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,23,45,120,1366,1035,2401
2016-03-23,23,March,2016,49,Adults (35-64),M,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,20,45,120,1188,900,2088
2014-05-15,15,May,2014,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,238,180,418
2016-05-15,15,May,2016,47,Adults (35-64),F,Australia,New South Wales,Accessories,Bike Racks,Hitch Rack - 4-Bike,5,45,120,297,225,522
2014-05-22,22,May,2014,47,Adults (35-64),F,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,4,45,120,199,180,379
2016-05-22,22,May,2016,47,Adults (35-64),F,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,2,45,120,100,90,190
2014-02-22,22,February,2014,35,Adults (35-64),M,Australia,Victoria,Accessories,Bike Racks,Hitch Rack - 4-Bike,22,45,120,1096,990,2086
In [3]:
sales = pd.read_csv(
    'data/sales_data.csv',
    parse_dates=['Date'])

green-divider

The data at a glance:

In [4]:
sales.head()
Out[4]:
Date Day Month Year Customer_Age Age_Group Customer_Gender Country State Product_Category Sub_Category Product Order_Quantity Unit_Cost Unit_Price Profit Cost Revenue
0 2013-11-26 26 November 2013 19 Youth (<25) M Canada British Columbia Accessories Bike Racks Hitch Rack - 4-Bike 8 45 120 590 360 950
1 2015-11-26 26 November 2015 19 Youth (<25) M Canada British Columbia Accessories Bike Racks Hitch Rack - 4-Bike 8 45 120 590 360 950
2 2014-03-23 23 March 2014 49 Adults (35-64) M Australia New South Wales Accessories Bike Racks Hitch Rack - 4-Bike 23 45 120 1366 1035 2401
3 2016-03-23 23 March 2016 49 Adults (35-64) M Australia New South Wales Accessories Bike Racks Hitch Rack - 4-Bike 20 45 120 1188 900 2088
4 2014-05-15 15 May 2014 47 Adults (35-64) F Australia New South Wales Accessories Bike Racks Hitch Rack - 4-Bike 4 45 120 238 180 418
In [5]:
sales.shape
Out[5]:
(113036, 18)
In [6]:
sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113036 entries, 0 to 113035
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Date              113036 non-null  datetime64[ns]
 1   Day               113036 non-null  int64         
 2   Month             113036 non-null  object        
 3   Year              113036 non-null  int64         
 4   Customer_Age      113036 non-null  int64         
 5   Age_Group         113036 non-null  object        
 6   Customer_Gender   113036 non-null  object        
 7   Country           113036 non-null  object        
 8   State             113036 non-null  object        
 9   Product_Category  113036 non-null  object        
 10  Sub_Category      113036 non-null  object        
 11  Product           113036 non-null  object        
 12  Order_Quantity    113036 non-null  int64         
 13  Unit_Cost         113036 non-null  int64         
 14  Unit_Price        113036 non-null  int64         
 15  Profit            113036 non-null  int64         
 16  Cost              113036 non-null  int64         
 17  Revenue           113036 non-null  int64         
dtypes: datetime64[ns](1), int64(9), object(8)
memory usage: 15.5+ MB
In [7]:
sales.describe()
Out[7]:
Day Year Customer_Age Order_Quantity Unit_Cost Unit_Price Profit Cost Revenue
count 113036.000000 113036.000000 113036.000000 113036.000000 113036.000000 113036.000000 113036.000000 113036.000000 113036.000000
mean 15.665753 2014.401739 35.919212 11.901660 267.296366 452.938427 285.051665 469.318695 754.370360
std 8.781567 1.272510 11.021936 9.561857 549.835483 922.071219 453.887443 884.866118 1309.094674
min 1.000000 2011.000000 17.000000 1.000000 1.000000 2.000000 -30.000000 1.000000 2.000000
25% 8.000000 2013.000000 28.000000 2.000000 2.000000 5.000000 29.000000 28.000000 63.000000
50% 16.000000 2014.000000 35.000000 10.000000 9.000000 24.000000 101.000000 108.000000 223.000000
75% 23.000000 2016.000000 43.000000 20.000000 42.000000 70.000000 358.000000 432.000000 800.000000
max 31.000000 2016.000000 87.000000 32.000000 2171.000000 3578.000000 15096.000000 42978.000000 58074.000000

green-divider

Numerical analysis and visualization

We'll analyze the Unit_Cost column:

In [8]:
sales['Unit_Cost'].describe()
Out[8]:
count    113036.000000
mean        267.296366
std         549.835483
min           1.000000
25%           2.000000
50%           9.000000
75%          42.000000
max        2171.000000
Name: Unit_Cost, dtype: float64
In [9]:
sales['Unit_Cost'].mean()
Out[9]:
267.296365759581
In [10]:
sales['Unit_Cost'].median()
Out[10]:
9.0
In [11]:
sales['Unit_Cost'].plot(kind='box', vert=False, figsize=(14,6))
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8af096b700>
In [12]:
sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8af0bf0f70>
In [13]:
ax = sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
ax.axvline(sales['Unit_Cost'].mean(), color='red')
ax.axvline(sales['Unit_Cost'].median(), color='green')
Out[13]:
<matplotlib.lines.Line2D at 0x7f8b17499250>
In [14]:
ax = sales['Unit_Cost'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Number of Sales')
ax.set_xlabel('dollars')
Out[14]:
Text(0.5, 0, 'dollars')

green-divider

Categorical analysis and visualization

We'll analyze the Age_Group column:

In [16]:
sales.head()
Out[16]:
Date Day Month Year Customer_Age Age_Group Customer_Gender Country State Product_Category Sub_Category Product Order_Quantity Unit_Cost Unit_Price Profit Cost Revenue
0 2013-11-26 26 November 2013 19 Youth (<25) M Canada British Columbia Accessories Bike Racks Hitch Rack - 4-Bike 8 45 120 590 360 950
1 2015-11-26 26 November 2015 19 Youth (<25) M Canada British Columbia Accessories Bike Racks Hitch Rack - 4-Bike 8 45 120 590 360 950
2 2014-03-23 23 March 2014 49 Adults (35-64) M Australia New South Wales Accessories Bike Racks Hitch Rack - 4-Bike 23 45 120 1366 1035 2401
3 2016-03-23 23 March 2016 49 Adults (35-64) M Australia New South Wales Accessories Bike Racks Hitch Rack - 4-Bike 20 45 120 1188 900 2088
4 2014-05-15 15 May 2014 47 Adults (35-64) F Australia New South Wales Accessories Bike Racks Hitch Rack - 4-Bike 4 45 120 238 180 418
In [15]:
sales['Age_Group'].value_counts()
Out[15]:
Adults (35-64)          55824
Young Adults (25-34)    38654
Youth (<25)             17828
Seniors (64+)             730
Name: Age_Group, dtype: int64
In [17]:
sales['Age_Group'].value_counts().plot(kind='pie', figsize=(6,6))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade8bc2e0>
In [18]:
ax = sales['Age_Group'].value_counts().plot(kind='bar', figsize=(14,6))
ax.set_ylabel('Number of Sales')
Out[18]:
Text(0, 0.5, 'Number of Sales')

green-divider

Relationship between the columns?

Can we find any significant relationship?

In [19]:
corr = sales.corr()

corr
Out[19]:
Day Year Customer_Age Order_Quantity Unit_Cost Unit_Price Profit Cost Revenue
Day 1.000000 -0.007635 -0.014296 -0.002412 0.003133 0.003207 0.004623 0.003329 0.003853
Year -0.007635 1.000000 0.040994 0.123169 -0.217575 -0.213673 -0.181525 -0.215604 -0.208673
Customer_Age -0.014296 0.040994 1.000000 0.026887 -0.021374 -0.020262 0.004319 -0.016013 -0.009326
Order_Quantity -0.002412 0.123169 0.026887 1.000000 -0.515835 -0.515925 -0.238863 -0.340382 -0.312895
Unit_Cost 0.003133 -0.217575 -0.021374 -0.515835 1.000000 0.997894 0.741020 0.829869 0.817865
Unit_Price 0.003207 -0.213673 -0.020262 -0.515925 0.997894 1.000000 0.749870 0.826301 0.818522
Profit 0.004623 -0.181525 0.004319 -0.238863 0.741020 0.749870 1.000000 0.902233 0.956572
Cost 0.003329 -0.215604 -0.016013 -0.340382 0.829869 0.826301 0.902233 1.000000 0.988758
Revenue 0.003853 -0.208673 -0.009326 -0.312895 0.817865 0.818522 0.956572 0.988758 1.000000
In [20]:
fig = plt.figure(figsize=(8,8))
plt.matshow(corr, cmap='RdBu', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);
In [21]:
sales.plot(kind='scatter', x='Customer_Age', y='Revenue', figsize=(6,6))
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade7c34f0>
In [22]:
sales.plot(kind='scatter', x='Revenue', y='Profit', figsize=(6,6))
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade7a70a0>
In [23]:
ax = sales[['Profit', 'Age_Group']].boxplot(by='Age_Group', figsize=(10,6))
ax.set_ylabel('Profit')
Out[23]:
Text(0, 0.5, 'Profit')
In [24]:
boxplot_cols = ['Year', 'Customer_Age', 'Order_Quantity', 'Unit_Cost', 'Unit_Price', 'Profit']

sales[boxplot_cols].plot(kind='box', subplots=True, layout=(2,3), figsize=(14,8))
Out[24]:
Year                 AxesSubplot(0.125,0.536818;0.227941x0.343182)
Customer_Age      AxesSubplot(0.398529,0.536818;0.227941x0.343182)
Order_Quantity    AxesSubplot(0.672059,0.536818;0.227941x0.343182)
Unit_Cost               AxesSubplot(0.125,0.125;0.227941x0.343182)
Unit_Price           AxesSubplot(0.398529,0.125;0.227941x0.343182)
Profit               AxesSubplot(0.672059,0.125;0.227941x0.343182)
dtype: object

green-divider

Column wrangling

We can also create new columns or modify existing ones.

Add and calculate a new Revenue_per_Age column

In [25]:
sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age']

sales['Revenue_per_Age'].head()
Out[25]:
0    50.000000
1    50.000000
2    49.000000
3    42.612245
4     8.893617
Name: Revenue_per_Age, dtype: float64
In [26]:
sales['Revenue_per_Age'].plot(kind='density', figsize=(14,6))
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade499d90>
In [27]:
sales['Revenue_per_Age'].plot(kind='hist', figsize=(14,6))
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade3f9100>

Add and calculate a new Calculated_Cost column

Use this formula

$$ Calculated\_Cost = Order\_Quantity * Unit\_Cost $$
In [28]:
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost']

sales['Calculated_Cost'].head()
Out[28]:
0     360
1     360
2    1035
3     900
4     180
Name: Calculated_Cost, dtype: int64
In [29]:
(sales['Calculated_Cost'] != sales['Cost']).sum()
Out[29]:
0

We can see the relationship between Cost and Profit using a scatter plot:

In [30]:
sales.plot(kind='scatter', x='Calculated_Cost', y='Profit', figsize=(6,6))
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade3df790>

Add and calculate a new Calculated_Revenue column

Use this formula

$$ Calculated\_Revenue = Cost + Profit $$
In [31]:
sales['Calculated_Revenue'] = sales['Cost'] + sales['Profit']

sales['Calculated_Revenue'].head()
Out[31]:
0     950
1     950
2    2401
3    2088
4     418
Name: Calculated_Revenue, dtype: int64
In [32]:
(sales['Calculated_Revenue'] != sales['Revenue']).sum()
Out[32]:
0
In [33]:
sales.head()
Out[33]:
Date Day Month Year Customer_Age Age_Group Customer_Gender Country State Product_Category ... Product Order_Quantity Unit_Cost Unit_Price Profit Cost Revenue Revenue_per_Age Calculated_Cost Calculated_Revenue
0 2013-11-26 26 November 2013 19 Youth (<25) M Canada British Columbia Accessories ... Hitch Rack - 4-Bike 8 45 120 590 360 950 50.000000 360 950
1 2015-11-26 26 November 2015 19 Youth (<25) M Canada British Columbia Accessories ... Hitch Rack - 4-Bike 8 45 120 590 360 950 50.000000 360 950
2 2014-03-23 23 March 2014 49 Adults (35-64) M Australia New South Wales Accessories ... Hitch Rack - 4-Bike 23 45 120 1366 1035 2401 49.000000 1035 2401
3 2016-03-23 23 March 2016 49 Adults (35-64) M Australia New South Wales Accessories ... Hitch Rack - 4-Bike 20 45 120 1188 900 2088 42.612245 900 2088
4 2014-05-15 15 May 2014 47 Adults (35-64) F Australia New South Wales Accessories ... Hitch Rack - 4-Bike 4 45 120 238 180 418 8.893617 180 418

5 rows × 21 columns

In [34]:
sales['Revenue'].plot(kind='hist', bins=100, figsize=(14,6))
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ade39afa0>

Modify all Unit_Price values adding 3% tax to them

In [35]:
sales['Unit_Price'].head()
Out[35]:
0    120
1    120
2    120
3    120
4    120
Name: Unit_Price, dtype: int64
In [36]:
#sales['Unit_Price'] = sales['Unit_Price'] * 1.03

sales['Unit_Price'] *= 1.03
In [37]:
sales['Unit_Price'].head()
Out[37]:
0    123.6
1    123.6
2    123.6
3    123.6
4    123.6
Name: Unit_Price, dtype: float64

green-divider

Selection & Indexing:

 Get all the sales made in the state of Kentucky

In [38]:
sales.loc[sales['State'] == 'Kentucky']
Out[38]:
Date Day Month Year Customer_Age Age_Group Customer_Gender Country State Product_Category ... Product Order_Quantity Unit_Cost Unit_Price Profit Cost Revenue Revenue_per_Age Calculated_Cost Calculated_Revenue
156 2013-11-04 4 November 2013 40 Adults (35-64) M United States Kentucky Accessories ... Hitch Rack - 4-Bike 1 45 123.60 63 45 108 2.700 45 108
157 2015-11-04 4 November 2015 40 Adults (35-64) M United States Kentucky Accessories ... Hitch Rack - 4-Bike 1 45 123.60 63 45 108 2.700 45 108
23826 2014-04-16 16 April 2014 40 Adults (35-64) M United States Kentucky Accessories ... Fender Set - Mountain 12 8 22.66 142 96 238 5.950 96 238
23827 2016-04-16 16 April 2016 40 Adults (35-64) M United States Kentucky Accessories ... Fender Set - Mountain 14 8 22.66 165 112 277 6.925 112 277
31446 2014-04-16 16 April 2014 40 Adults (35-64) M United States Kentucky Accessories ... Sport-100 Helmet, Blue 29 13 36.05 537 377 914 22.850 377 914
31447 2016-04-16 16 April 2016 40 Adults (35-64) M United States Kentucky Accessories ... Sport-100 Helmet, Blue 31 13 36.05 574 403 977 24.425 403 977
79670 2014-04-16 16 April 2014 40 Adults (35-64) M United States Kentucky Accessories ... ML Mountain Tire 2 11 30.90 32 22 54 1.350 22 54
79671 2014-04-16 16 April 2014 40 Adults (35-64) M United States Kentucky Accessories ... ML Mountain Tire 21 11 30.90 336 231 567 14.175 231 567
79672 2016-04-16 16 April 2016 40 Adults (35-64) M United States Kentucky Accessories ... ML Mountain Tire 1 11 30.90 16 11 27 0.675 11 27
79673 2016-04-16 16 April 2016 40 Adults (35-64) M United States Kentucky Accessories ... ML Mountain Tire 18 11 30.90 288 198 486 12.150 198 486

10 rows × 21 columns

Get the mean revenue of the Adults (35-64) sales group

In [39]:
sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].mean()
Out[39]:
762.8287654055604

How many records belong to Age Group Youth (<25) or Adults (35-64)?

In [43]:
sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]
Out[43]:
73652

Get the mean revenue of the sales group Adults (35-64) in United States

In [44]:
sales.loc[(sales['Age_Group'] == 'Adults (35-64)') & (sales['Country'] == 'United States'), 'Revenue'].mean()
Out[44]:
726.7260473588342

 Increase the revenue by 10% to every sale made in France

In [45]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()
Out[45]:
50     787
51     787
52    2957
53    2851
60     626
Name: Revenue, dtype: int64
In [46]:
#sales.loc[sales['Country'] == 'France', 'Revenue'] = sales.loc[sales['Country'] == 'France', 'Revenue'] * 1.1

sales.loc[sales['Country'] == 'France', 'Revenue'] *= 1.1
In [47]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()
Out[47]:
50     865.7
51     865.7
52    3252.7
53    3136.1
60     688.6
Name: Revenue, dtype: float64

purple-divider

Notebooks AI
Notebooks AI Profile20060