Profile picture

Outlier Detection Using Boxplots

Last updated: June 13th, 20192019-06-13Project preview

rmotr


Outlier detection with BoxplotsĀ¶

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.

šŸ‘‰ External resource: To learn more about Interpreting box plots, check out this video from Khan Academy: https://youtu.be/oBREri10ZHk

purple-divider

Hands on!Ā¶

InĀ [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

green-divider

BoxplotĀ¶

A box and whisker plot ā€”also called a box plotā€” displays five-number summary of a set of data.

Boxplots are a standardized way of displaying the distribution of data based on a five number summary ("minimum", first quartile (Q1), median, third quartile (Q3), and "maximum").

This type of plot is used to easily detect outliers. It can also tell us if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

boxplot

  • median (Q2/50th Percentile): the middle value of the dataset.
  • first quartile (Q1/25th Percentile): the middle number between the smallest number (not the "minimum") and the median of the dataset.
  • third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the "maximum") of the dataset.
  • InterQuartile Range (IQR): 25th to the 75th percentile. IQR tells how spread the middle values are.
  • "maximum": Q3 + 1.5*IQR
  • "minimum": Q1 -1.5*IQR
  • Outliers: (shown as green circles) In statistics, an outlier is an observation point that is distant from other observations.

Not every outlier is a wrong value.

green-divider

Boxplot of a Normal distributionĀ¶

InĀ [3]:
normal = np.random.normal(0, 1, 10000) # loc, scale, size
quartiles = pd.DataFrame(normal).quantile([0.25, 0.5, 0.75, 1])[0]

fig, axs = plt.subplots(nrows=2)
fig.set_size_inches(14, 8)

# Boxplot of Normal distribution
plot1 = sns.boxplot(normal, ax=axs[0])
plot1.set(xlim=(-4, 4))

# Normal distribution
plot2 = sns.distplot(normal, ax=axs[1])
plot2.set(xlim=(-4, 4))

#Ā Median line
plt.axvline(np.median(normal), color='#e74c3c', linestyle='dashed', linewidth=2)

for i, q in enumerate(quartiles):
    # Quartile i line
    plt.axvline(q, color='#27ae60', linestyle='dotted', linewidth=2)    

green-divider

Ā Outliers: drop them or notĀ¶

Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers. Despite this, it is not acceptable to drop an observation just because it is an outlier. They can be legitimate observations and it's important to investigate the nature of the outlier before deciding.

  1. If it is obvious that the outlier is due to incorrectly entered or measured data, you should drop the outlier.
  2. If the outlier does not change the results but does affect assumptions, you may drop the outlier. But note that in a footnote of your paper.

  1. More commonly, the outlier affects both results and assumptions. In this situation, it is not legitimate to simply drop the outlier. You may run the analysis both with and without it, but you should state in at least a footnote the dropping of any such data points and how the results changed.

  1. If the outlier creates a significant association, you should drop the outlier and should not report any significance from your analysis.

So in those cases where you shouldn't drop the outlier, what do you do?

One option is to try a transformation. Square root and log transformations both pull in high numbers. This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

Another option is to try a different model. This should be done with caution, but it may be that a non-linear model fits better. For example, in example 3, perhaps an exponential curve fits the data with the outlier intact.

Whichever approach you take, you need to know your data and your research area well. Try different approaches, and see which make theoretical sense.

green-divider

Ā Removing outliersĀ¶

Now that we know how to build a boxplot and visualize outliers (points outside whiskers), lets remove them:

InĀ [4]:
plt.figure(figsize=(12,6))

sns.boxplot(normal)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3d77b70d68>

Boxplot show us many outliers, but are they wrong values?

We can manually remove values below/above a certain value:

InĀ [5]:
normal[(normal >= -3) & (normal <= 3)]
Out[5]:
array([ 0.03743201,  0.69575473,  0.9202912 , ..., -0.8318123 ,
       -0.36852967,  0.90217951])
InĀ [6]:
plt.figure(figsize=(12,6))

sns.boxplot(normal[(normal >= -3) & (normal <= 3)])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3d75a982b0>

Or use low and high fences of the boxplot and remove outer elements:

InĀ [66]:
q1 = pd.DataFrame(normal).quantile(0.25)[0]
q3 = pd.DataFrame(normal).quantile(0.75)[0]
iqr = q3 - q1 #Interquartile range

fence_low = q1 - (1.5*iqr)
fence_high = q3 + (1.5*iqr)
InĀ [67]:
iqr
Out[67]:
1.3323599334949785
InĀ [68]:
fence_low
Out[68]:
-2.676000401933399
InĀ [69]:
fence_high
Out[69]:
2.6534393320465153
InĀ [70]:
# "Outside" boxplot Reviews
normal[(normal < fence_low) | (normal > fence_high)].shape[0]
Out[70]:
70

Keep just the "inside" boxplot points:

InĀ [73]:
plt.figure(figsize=(12,6))

sns.boxplot(normal[(normal >= fence_low) & (normal <= fence_high)])
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f929998db00>

purple-divider

Notebooks AI
Notebooks AI Profile20060