# Data Science - Class 1

Last updated: January 13th, 2020

# Bike store sales¶

In this class we'll be analyzing sales made on bike stores.

## Hands on!¶

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


## Loading our data:¶

In [ ]:
!head data/sales_data.csv

In [ ]:
sales = pd.read_csv(
'data/sales_data.csv',
parse_dates=['Date'])


## The data at a glance:¶

In [ ]:
sales.head()

In [ ]:
sales.shape

In [ ]:
sales.info()

In [ ]:
sales.describe()


## Numerical analysis and visualization¶

We'll analyze the Unit_Cost column:

In [ ]:
sales['Unit_Cost'].describe()

In [ ]:
sales['Unit_Cost'].mean()

In [ ]:
sales['Unit_Cost'].median()

In [ ]:
sales['Unit_Cost'].plot(kind='box', vert=False, figsize=(14,6))

In [ ]:
sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde

In [ ]:
ax = sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
ax.axvline(sales['Unit_Cost'].mean(), color='red')
ax.axvline(sales['Unit_Cost'].median(), color='green')

In [ ]:
ax = sales['Unit_Cost'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Number of Sales')
ax.set_xlabel('dollars')


## Categorical analysis and visualization¶

We'll analyze the Age_Group column:

In [ ]:
sales['Age_Group'].value_counts()

In [ ]:
sales['Age_Group'].value_counts().plot(kind='pie', figsize=(6,6))

In [ ]:
ax = sales['Age_Group'].value_counts().plot(kind='bar', figsize=(14,6))
ax.set_ylabel('Number of Sales')


## Relationship between the columns?¶

Can we find any significant relationship?

In [ ]:
corr = sales.corr()

corr

In [ ]:
fig = plt.figure(figsize=(8,8))
plt.matshow(corr, cmap='RdBu', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);

In [ ]:
sales.plot(kind='scatter', x='Customer_Age', y='Revenue', figsize=(6,6))

In [ ]:
sales.plot(kind='scatter', x='Revenue', y='Profit', figsize=(6,6))

In [ ]:
ax = sales[['Profit', 'Age_Group']].boxplot(by='Age_Group', figsize=(10,6))
ax.set_ylabel('Profit')

In [ ]:
boxplot_cols = ['Year', 'Customer_Age', 'Order_Quantity', 'Unit_Cost', 'Unit_Price', 'Profit']

sales[boxplot_cols].plot(kind='box', subplots=True, layout=(2,3), figsize=(14,8))


## Column wrangling¶

We can also create new columns or modify existing ones.

### Add and calculate a new Revenue_per_Age column¶

In [ ]:
sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age']

sales['Revenue_per_Age'].head()

In [ ]:
sales['Revenue_per_Age'].plot(kind='density', figsize=(14,6))

In [ ]:
sales['Revenue_per_Age'].plot(kind='hist', figsize=(14,6))


### Add and calculate a new Calculated_Cost column¶

Use this formula

$$Calculated\_Cost = Order\_Quantity * Unit\_Cost$$
In [ ]:
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost']

sales['Calculated_Cost'].head()

In [ ]:
(sales['Calculated_Cost'] != sales['Cost']).sum()


We can see the relationship between Cost and Profit using a scatter plot:

In [ ]:
sales.plot(kind='scatter', x='Calculated_Cost', y='Profit', figsize=(6,6))


### Add and calculate a new Calculated_Revenue column¶

Use this formula

$$Calculated\_Revenue = Cost + Profit$$
In [ ]:
sales['Calculated_Revenue'] = sales['Cost'] + sales['Profit']

sales['Calculated_Revenue'].head()

In [ ]:
(sales['Calculated_Revenue'] != sales['Revenue']).sum()

In [ ]:
sales.head()

In [ ]:
sales['Revenue'].plot(kind='hist', bins=100, figsize=(14,6))


### Modify all Unit_Price values adding 3% tax to them¶

In [ ]:
sales['Unit_Price'].head()

In [ ]:
#sales['Unit_Price'] = sales['Unit_Price'] * 1.03

sales['Unit_Price'] *= 1.03

In [ ]:
sales['Unit_Price'].head()


## Selection & Indexing:¶

### Get all the sales made in the state of Kentucky¶

In [ ]:
sales.loc[sales['State'] == 'Kentucky']


### Get the mean revenue of the Adults (35-64) sales group¶

In [ ]:
sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].mean()


### How many records belong to Age Group Youth (<25) or Adults (35-64)?¶

In [ ]:
sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]


### Get the mean revenue of the sales group Adults (35-64) in United States¶

In [ ]:
sales.loc[(sales['Age_Group'] == 'Adults (35-64)') & (sales['Country'] == 'United States'), 'Revenue'].mean()


### Increase the revenue by 10% to every sale made in France¶

In [ ]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()

In [ ]:
#sales.loc[sales['Country'] == 'France', 'Revenue'] = sales.loc[sales['Country'] == 'France', 'Revenue'] * 1.1

sales.loc[sales['Country'] == 'France', 'Revenue'] *= 1.1

In [ ]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()