Hands on!¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Loading our data:¶
In [2]:
!head data/sales_data.csv
In [3]:
sales = pd.read_csv(
'data/sales_data.csv',
parse_dates=['Date'])
The data at a glance:¶
In [4]:
sales.head()
Out[4]:
In [5]:
sales.shape
Out[5]:
In [6]:
sales.info()
In [7]:
sales.describe()
Out[7]:
In [8]:
sales['Unit_Cost'].describe()
Out[8]:
In [9]:
sales['Unit_Cost'].mean()
Out[9]:
In [10]:
sales['Unit_Cost'].median()
Out[10]:
In [11]:
sales['Unit_Cost'].plot(kind='box', vert=False, figsize=(14,6))
Out[11]:
In [12]:
sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
Out[12]:
In [13]:
ax = sales['Unit_Cost'].plot(kind='density', figsize=(14,6)) # kde
ax.axvline(sales['Unit_Cost'].mean(), color='red')
ax.axvline(sales['Unit_Cost'].median(), color='green')
Out[13]:
In [14]:
ax = sales['Unit_Cost'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Number of Sales')
ax.set_xlabel('dollars')
Out[14]:
In [16]:
sales.head()
Out[16]:
In [15]:
sales['Age_Group'].value_counts()
Out[15]:
In [17]:
sales['Age_Group'].value_counts().plot(kind='pie', figsize=(6,6))
Out[17]:
In [18]:
ax = sales['Age_Group'].value_counts().plot(kind='bar', figsize=(14,6))
ax.set_ylabel('Number of Sales')
Out[18]:
In [19]:
corr = sales.corr()
corr
Out[19]:
In [20]:
fig = plt.figure(figsize=(8,8))
plt.matshow(corr, cmap='RdBu', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);
In [21]:
sales.plot(kind='scatter', x='Customer_Age', y='Revenue', figsize=(6,6))
Out[21]:
In [22]:
sales.plot(kind='scatter', x='Revenue', y='Profit', figsize=(6,6))
Out[22]:
In [23]:
ax = sales[['Profit', 'Age_Group']].boxplot(by='Age_Group', figsize=(10,6))
ax.set_ylabel('Profit')
Out[23]:
In [24]:
boxplot_cols = ['Year', 'Customer_Age', 'Order_Quantity', 'Unit_Cost', 'Unit_Price', 'Profit']
sales[boxplot_cols].plot(kind='box', subplots=True, layout=(2,3), figsize=(14,8))
Out[24]:
In [25]:
sales['Revenue_per_Age'] = sales['Revenue'] / sales['Customer_Age']
sales['Revenue_per_Age'].head()
Out[25]:
In [26]:
sales['Revenue_per_Age'].plot(kind='density', figsize=(14,6))
Out[26]:
In [27]:
sales['Revenue_per_Age'].plot(kind='hist', figsize=(14,6))
Out[27]:
Add and calculate a new Calculated_Cost
column¶
Use this formula
$$ Calculated\_Cost = Order\_Quantity * Unit\_Cost $$In [28]:
sales['Calculated_Cost'] = sales['Order_Quantity'] * sales['Unit_Cost']
sales['Calculated_Cost'].head()
Out[28]:
In [29]:
(sales['Calculated_Cost'] != sales['Cost']).sum()
Out[29]:
We can see the relationship between Cost
and Profit
using a scatter plot:
In [30]:
sales.plot(kind='scatter', x='Calculated_Cost', y='Profit', figsize=(6,6))
Out[30]:
Add and calculate a new Calculated_Revenue
column¶
Use this formula
$$ Calculated\_Revenue = Cost + Profit $$In [31]:
sales['Calculated_Revenue'] = sales['Cost'] + sales['Profit']
sales['Calculated_Revenue'].head()
Out[31]:
In [32]:
(sales['Calculated_Revenue'] != sales['Revenue']).sum()
Out[32]:
In [33]:
sales.head()
Out[33]:
In [34]:
sales['Revenue'].plot(kind='hist', bins=100, figsize=(14,6))
Out[34]:
Modify all Unit_Price
values adding 3% tax to them¶
In [35]:
sales['Unit_Price'].head()
Out[35]:
In [36]:
#sales['Unit_Price'] = sales['Unit_Price'] * 1.03
sales['Unit_Price'] *= 1.03
In [37]:
sales['Unit_Price'].head()
Out[37]:
Selection & Indexing:¶
Get all the sales made in the state of Kentucky
¶
In [38]:
sales.loc[sales['State'] == 'Kentucky']
Out[38]:
Get the mean revenue of the Adults (35-64)
sales group¶
In [39]:
sales.loc[sales['Age_Group'] == 'Adults (35-64)', 'Revenue'].mean()
Out[39]:
How many records belong to Age Group Youth (<25)
or Adults (35-64)
?¶
In [43]:
sales.loc[(sales['Age_Group'] == 'Youth (<25)') | (sales['Age_Group'] == 'Adults (35-64)')].shape[0]
Out[43]:
Get the mean revenue of the sales group Adults (35-64)
in United States
¶
In [44]:
sales.loc[(sales['Age_Group'] == 'Adults (35-64)') & (sales['Country'] == 'United States'), 'Revenue'].mean()
Out[44]:
Increase the revenue by 10% to every sale made in France¶
In [45]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()
Out[45]:
In [46]:
#sales.loc[sales['Country'] == 'France', 'Revenue'] = sales.loc[sales['Country'] == 'France', 'Revenue'] * 1.1
sales.loc[sales['Country'] == 'France', 'Revenue'] *= 1.1
In [47]:
sales.loc[sales['Country'] == 'France', 'Revenue'].head()
Out[47]: