# Correlation and Two-Variable Relationships

Last updated: January 13th, 2020

# Correlation and two-variable relationships¶

Exploratory data analysis in many modeling projects, whether in data science or in research, involves examining correlation among predictors, and between predictors and a target variable.

Also, the standard way to visualize the relationship between two measured data variables is with scatter plots with on variable on each X and Y axis.

## Hands on!¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


We'll use the following McDonald's menu nutrition facts dataset in this lesson.

This dataset provides a nutrition analysis of every menu item on the US McDonald's menu, including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts.

In [2]:
df = pd.read_csv('data/mcdonalds_menu.csv')


Out[2]:
Category Item Serving Size Calories Calories from Fat Total Fat Total Fat (% Daily Value) Saturated Fat Saturated Fat (% Daily Value) Trans Fat ... Carbohydrates Carbohydrates (% Daily Value) Dietary Fiber Dietary Fiber (% Daily Value) Sugars Protein Vitamin A (% Daily Value) Vitamin C (% Daily Value) Calcium (% Daily Value) Iron (% Daily Value)
0 Breakfast Egg McMuffin 4.8 oz (136 g) 300 120 13.0 20 5.0 25 0.0 ... 31 10 4 17 3 17 10 0 25 15
1 Breakfast Egg White Delight 4.8 oz (135 g) 250 70 8.0 12 3.0 15 0.0 ... 30 10 4 17 3 18 6 0 25 8
2 Breakfast Sausage McMuffin 3.9 oz (111 g) 370 200 23.0 35 8.0 42 0.0 ... 29 10 4 17 2 14 8 0 25 10
3 Breakfast Sausage McMuffin with Egg 5.7 oz (161 g) 450 250 28.0 43 10.0 52 0.0 ... 30 10 4 17 2 21 15 0 30 15
4 Breakfast Sausage McMuffin with Egg Whites 5.7 oz (161 g) 400 210 23.0 35 8.0 42 0.0 ... 30 10 4 17 2 21 6 0 25 10

5 rows × 24 columns

### Correlation¶

Correlation is one of the most widely used statistical concepts.

The correlation coefficient is a metric that measures the extent to which numeric variables are associated with one another. It takes values from -1 to +1.

The correlation matrix is a table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.

In [3]:
df.dtypes

Out[3]:
Category                          object
Item                              object
Serving Size                      object
Calories                           int64
Calories from Fat                  int64
Total Fat                        float64
Total Fat (% Daily Value)          int64
Saturated Fat                    float64
Saturated Fat (% Daily Value)      int64
Trans Fat                        float64
Cholesterol                        int64
Cholesterol (% Daily Value)        int64
Sodium                             int64
Sodium (% Daily Value)             int64
Carbohydrates                      int64
Carbohydrates (% Daily Value)      int64
Dietary Fiber                      int64
Dietary Fiber (% Daily Value)      int64
Sugars                             int64
Protein                            int64
Vitamin A (% Daily Value)          int64
Vitamin C (% Daily Value)          int64
Calcium (% Daily Value)            int64
Iron (% Daily Value)               int64
dtype: object
In [4]:
corr = df.corr()

sns.heatmap(corr)

Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd225bd7470>

The seaborn's heatmap function accepts many parameters to make our plot looks prettier:

In [5]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)

plt.figure(figsize=(18, 10))

center=0, square=True, annot=True, fmt=".1f")

Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd223aa7d68>

### Two-variable relationships¶

A scatterplot is a plot in which the x-axis is the value of one variable, and the y-axis the value of another.

#### Numeric - Numeric¶

In [6]:
fig, axs = plt.subplots(1, 4, figsize=(18, 4))

cols = ['Sodium', 'Carbohydrates', 'Total Fat', 'Calcium (% Daily Value)']

for i, col in enumerate(cols):
sns.kdeplot(df['Cholesterol'].values, df[col].values, shade=True, ax=axs[i])
sns.scatterplot(x='Cholesterol', y=col, data=df, size=1, legend=False, ax=axs[i])

In [7]:
plt.figure(figsize=(12, 6))

sns.scatterplot(x='Carbohydrates', y='Sodium', data=df, size=1, legend=False)

Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd2238e4e48>
In [8]:
plt.figure(figsize=(12, 6))

sns.scatterplot(x='Carbohydrates', y='Cholesterol', data=df, size=1, legend=False)

Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd2237b4908>
In [9]:
sns.pairplot(data=df, vars=['Calories', 'Total Fat', 'Sugars', 'Protein'])

Out[9]:
<seaborn.axisgrid.PairGrid at 0x7fd223894588>
In [10]:
plt.figure(figsize=(12, 6))

sns.scatterplot(x='Sugars', y='Protein', data=df)

Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd2216bfd68>
In [11]:
plt.figure(figsize=(14, 6))

sns.regplot(x='Sugars', y='Protein', data=df)

Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd221815978>

#### Numeric - Categorical¶

In [12]:
plt.figure(figsize=(12, 6))

sns.boxplot(x='Category', y='Calories', data=df)

Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd223845390>
In [13]:
plt.figure(figsize=(12, 6))

sns.scatterplot(x='Category', y='Calories', data=df)

Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd2213c9978>