Profile picture

2.8 - Correlation and Two-Variable Relationships

Last updated: April 3rd, 20192019-04-03Project preview

rmotr


Correlation and two-variable relationships

Exploratory data analysis in many modeling projects, whether in data science or in research, involves examining correlation among predictors, and between predictors and a target variable.

Also, the standard way to visualize the relationship between two measured data variables is with scatter plots with on variable on each X and Y axis.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

We'll use the following McDonald's menu nutrition facts dataset in this lesson.

This dataset provides a nutrition analysis of every menu item on the US McDonald's menu, including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts.

In [2]:
df = pd.read_csv('data/mcdonalds_menu.csv')

df.head()
Out[2]:
Category Item Serving Size Calories Calories from Fat Total Fat Total Fat (% Daily Value) Saturated Fat Saturated Fat (% Daily Value) Trans Fat ... Carbohydrates Carbohydrates (% Daily Value) Dietary Fiber Dietary Fiber (% Daily Value) Sugars Protein Vitamin A (% Daily Value) Vitamin C (% Daily Value) Calcium (% Daily Value) Iron (% Daily Value)
0 Breakfast Egg McMuffin 4.8 oz (136 g) 300 120 13.0 20 5.0 25 0.0 ... 31 10 4 17 3 17 10 0 25 15
1 Breakfast Egg White Delight 4.8 oz (135 g) 250 70 8.0 12 3.0 15 0.0 ... 30 10 4 17 3 18 6 0 25 8
2 Breakfast Sausage McMuffin 3.9 oz (111 g) 370 200 23.0 35 8.0 42 0.0 ... 29 10 4 17 2 14 8 0 25 10
3 Breakfast Sausage McMuffin with Egg 5.7 oz (161 g) 450 250 28.0 43 10.0 52 0.0 ... 30 10 4 17 2 21 15 0 30 15
4 Breakfast Sausage McMuffin with Egg Whites 5.7 oz (161 g) 400 210 23.0 35 8.0 42 0.0 ... 30 10 4 17 2 21 6 0 25 10

5 rows × 24 columns

green-divider

 Correlation

Correlation is one of the most widely used statistical concepts.

The correlation coefficient is a metric that measures the extent to which numeric variables are associated with one another. It takes values from -1 to +1.

The correlation matrix is a table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

In [3]:
df.dtypes
Out[3]:
Category                          object
Item                              object
Serving Size                      object
Calories                           int64
Calories from Fat                  int64
Total Fat                        float64
Total Fat (% Daily Value)          int64
Saturated Fat                    float64
Saturated Fat (% Daily Value)      int64
Trans Fat                        float64
Cholesterol                        int64
Cholesterol (% Daily Value)        int64
Sodium                             int64
Sodium (% Daily Value)             int64
Carbohydrates                      int64
Carbohydrates (% Daily Value)      int64
Dietary Fiber                      int64
Dietary Fiber (% Daily Value)      int64
Sugars                             int64
Protein                            int64
Vitamin A (% Daily Value)          int64
Vitamin C (% Daily Value)          int64
Calcium (% Daily Value)            int64
Iron (% Daily Value)               int64
dtype: object
In [4]:
corr = df.corr()

sns.heatmap(corr)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc775503748>

The seaborn's heatmap function accepts many parameters to make our plot looks prettier:

In [5]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(18, 10))

sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1,
            center=0, square=True, annot=True, fmt=".1f")
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc773360198>

green-divider

 Two-variable relationships

A scatterplot is a plot in which the x-axis is the value of one variable, and the y-axis the value of another.


Numeric - Numeric

In [55]:
fig, axs = plt.subplots(1, 4, figsize=(18, 4))

cols = ['Sodium', 'Carbohydrates', 'Total Fat', 'Calcium (% Daily Value)']

for i, col in enumerate(cols):
    sns.kdeplot(df['Cholesterol'].values, df[col].values, shade=True, ax=axs[i])
    sns.scatterplot(x='Cholesterol', y=col, data=df, size=1, legend=False, ax=axs[i])
In [35]:
plt.figure(figsize=(12, 6))

sns.kdeplot(df['Carbohydrates'].values, df['Sodium'].values, shade=True)

sns.scatterplot(x='Carbohydrates', y='Sodium', data=df, size=1, legend=False)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc7711e6a90>
In [36]:
plt.figure(figsize=(12, 6))

sns.kdeplot(df['Carbohydrates'].values, df['Cholesterol'].values, shade=True)

sns.scatterplot(x='Carbohydrates', y='Cholesterol', data=df, size=1, legend=False)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc771164c50>
In [116]:
sns.pairplot(data=df)
Out[116]:
<seaborn.axisgrid.PairGrid at 0x7f05a8385048>