Profile picture

Data Science - Class 1

Last updated: January 13th, 20202020-01-13Project preview

rmotr


Bike store sales

In this class we'll be analyzing sales made on bike stores.

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

Loading our data:

In [ ]:
!head data/sales_data.csv
In [ ]:
sales = pd.read_csv(
    'data/sales_data.csv',
    parse_dates=['Date'])

green-divider

The data at a glance:

In [ ]:
sales.head()
In [ ]:
sales.shape
In [ ]:
sales.info()
In [ ]:
sales.describe()

green-divider

Numerical analysis and visualization

We'll analyze the Unit_Cost column:

In [ ]:
sales['Unit_Cost'].describe()
In [ ]:
sales['Unit_Cost'].mean()
In [ ]:
sales['Unit_Cost'].median()
In [ ]:
sales['Unit_Cost'].plot(kind='box', vert=False, figsize=(14,6))
In [ ]:
sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
In [ ]:
ax = sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
ax.axvline(sales['Unit_Cost'].mean(), color='red')
ax.axvline(sales['Unit_Cost'].median(), color='green')
In [ ]:
ax = sales['Unit_Cost'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Number of Sales')
ax.set_xlabel('dollars')

green-divider

Categorical analysis and visualization

We'll analyze the Age_Group column:

In [ ]:
sales['Age_Group'].value_counts()
In [ ]:
sales['Age_Group'].value_counts().plot(kind='pie', figsize=(6,6))
In [ ]:
ax = sales['Age_Group'].value_counts().plot(kind='bar', figsize=(14,6))
ax.set_ylabel('Number of Sales')

green-divider

Relationship between the columns?

Can we find any significant relationship?

In [ ]:
corr = sales.corr()

corr
In [ ]:
fig = plt.figure(figsize=(8,8))
plt.matshow(corr, cmap='RdBu', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);
In [ ]:
sales.plot(kind='scatter', x='Customer_Age', y='Revenue', figsize=(6,6))
In [ ]:
sales.plot(kind='scatter', x='Revenue', y='Profit', figsize=(6,6))
In [ ]:
ax = sales[['Profit', 'Age_Group']].boxplot(by='Age_Group', figsize=(10,6))
ax.set_ylabel('Profit')
In [ ]:
boxplot_cols = ['Year', 'Customer_Age', 'Order_Quantity', 'Unit_Cost', 'Unit_Price', 'Profit']

sales[boxplot_cols].plot(kind='box', subplots=True, layout=(2,3), figsize=(14,8))

green-divider

Column wrangling

We can also create new columns or modify existing ones.

Add and calculate a new Revenue_per_Age column

In [ ]:
sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age']

sales['Revenue_per_Age'].head()
In [ ]:
sales['Revenue_per_Age'].plot(kind='density', figsize=(14,6))
In [ ]:
sales['Revenue_per_Age'].plot(kind='hist', figsize=(14,6))

Add and calculate a new Calculated_Cost column

Use this formula

$$ Calculated\_Cost = Order\_Quantity * Unit\_Cost $$
In [ ]:
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost']

sales['Calculated_Cost'].head()
In [ ]:
(sales['Calculated_Cost'] != sales['Cost']).sum()

We can see the relationship between Cost and Profit using a scatter plot:

In [ ]:
sales.plot(kind='scatter', x='Calculated_Cost', y='Profit', figsize=(6,6))

Add and calculate a new Calculated_Revenue column

Use this formula

$$ Calculated\_Revenue = Cost + Profit $$
In [ ]:
sales['Calculated_Revenue'] = sales['Cost'] + sales['Profit']

sales['Calculated_Revenue'].head()
In [ ]:
(sales['Calculated_Revenue'] != sales['Revenue']).sum()
In [ ]:
sales.head()
In [ ]:
sales['Revenue'].plot(kind='hist', bins=100, figsize=(14,6))

Modify all Unit_Price values adding 3% tax to them

In [ ]:
sales['Unit_Price'].head()
In [ ]:
#sales['Unit_Price'] = sales['Unit_Price'] * 1.03

sales['Unit_Price'] *= 1.03
In [ ]:
sales['Unit_Price'].head()

green-divider

Selection & Indexing:

 Get all the sales made in the state of Kentucky

In [ ]:
sales.loc[sales['State'] == 'Kentucky']

Get the mean revenue of the Adults (35-64) sales group

In [ ]:
sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].mean()

How many records belong to Age Group Youth (<25) or Adults (35-64)?

In [ ]:
sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]

Get the mean revenue of the sales group Adults (35-64) in United States

In [ ]:
sales.loc[(sales['Age_Group'] == 'Adults (35-64)') & (sales['Country'] == 'United States'), 'Revenue'].mean()

 Increase the revenue by 10% to every sale made in France

In [ ]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()
In [ ]:
#sales.loc[sales['Country'] == 'France', 'Revenue'] = sales.loc[sales['Country'] == 'France', 'Revenue'] * 1.1

sales.loc[sales['Country'] == 'France', 'Revenue'] *= 1.1
In [ ]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()

purple-divider

Notebooks AI
Notebooks AI Profile20060