Profile picture

2.7 - Outlier Detection Using Boxplots

Last updated: April 3rd, 20192019-04-03Project preview

rmotr


Outlier detection with Boxplots

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.

purple-divider

Hands on!

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

green-divider

Boxplot

A box and whisker plot —also called a box plot— displays five-number summary of a set of data.

Boxplots are a standardized way of displaying the distribution of data based on a five number summary ("minimum", first quartile (Q1), median, third quartile (Q3), and "maximum").

This type of plot is used to easily detect outliers. It can also tell us if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

boxplot

  • median (Q2/50th Percentile): the middle value of the dataset.
  • first quartile (Q1/25th Percentile): the middle number between the smallest number (not the "minimum") and the median of the dataset.
  • third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the "maximum") of the dataset.
  • InterQuartile Range (IQR): 25th to the 75th percentile. IQR tells how spread the middle values are.
  • "maximum": Q3 + 1.5*IQR
  • "minimum": Q1 -1.5*IQR
  • Outliers: (shown as green circles) In statistics, an outlier is an observation point that is distant from other observations.

Not every outlier is a wrong value.

green-divider

Boxplot of a Normal distribution

In [17]:
normal = np.random.normal(0, 1, 10000) # loc, scale, size
quartiles = pd.DataFrame(normal).quantile([0.25, 0.5, 0.75, 1])[0]

fig, axs = plt.subplots(nrows=2)
fig.set_size_inches(14, 8)

# Boxplot of Normal distribution
plot1 = sns.boxplot(normal, ax=axs[0])
plot1.set(xlim=(-4, 4))

# Normal distribution
plot2 = sns.distplot(normal, ax=axs[1])
plot2.set(xlim=(-4, 4))

# Median line
plt.axvline(np.median(normal), color='#e74c3c', linestyle='dashed', linewidth=2)

for i, q in enumerate(quartiles):
    # Quartile i line
    plt.axvline(q, color='#27ae60', linestyle='dotted', linewidth=2)    

green-divider

 Outliers detection

Now that we know how to build a boxplot and visualize outliers (points outside whiskers), lets remove them:

In [34]:
plt.figure(figsize=(12,6))

sns.boxplot(normal)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f929a016b00>

Boxplot show us many outliers, but are they wrong values?

We can manually remove values below/above a certain value:

In [44]:
 
Out[44]:
array([ 0.46258854,  1.03999792, -0.96583145, ...,  1.36174267,
       -0.7491783 ,  0.34268328])
In [65]:
plt.figure(figsize=(12,6))

sns.boxplot(normal[(normal >= -3) & (normal <= 3)])
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9299975518>

Or use low and high fences of the boxplot and remove outer elements:

In [66]:
q1 = pd.DataFrame(normal).quantile(0.25)[0]
q3 = pd.DataFrame(normal).quantile(0.75)[0]
iqr = q3 - q1 #Interquartile range

fence_low = q1 - (1.5*iqr)
fence_high = q3 + (1.5*iqr)
In [67]:
iqr
Out[67]:
1.3323599334949785
In [68]:
fence_low
Out[68]:
-2.676000401933399
In [69]:
fence_high
Out[69]:
2.6534393320465153
In [70]:
# "Outside" boxplot Reviews
normal[(normal < fence_low) | (normal > fence_high)].shape[0]
Out[70]:
70

Keep just the "inside" boxplot points:

In [73]:
plt.figure(figsize=(12,6))

sns.boxplot(normal[(normal >= fence_low) & (normal <= fence_high)])
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f929998db00>

purple-divider

Notebooks AI
Notebooks AI Profile20060