Profile picture

Intro to Statistics

Last updated: January 13th, 20202020-01-13Project preview

rmotr


Intro to StatisticsĀ¶

Quick overview of basic estimators.

šŸ‘‰ External resource: To learn more about Mean, Median, Mode and Central Tendency, check out this video from Khan Academy: https://youtu.be/h8EYEJ32oQ8

purple-divider

Hands on!Ā¶

InĀ [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

green-divider

Probability distribution fittingĀ¶

The aim of distribution fitting is to predict the probability or to forecast the frequency of occurrence of the magnitude of the phenomenon in a certain interval.

We can understand data showing its distribution.

green-divider

Basic estimatorsĀ¶

We'll try to explain basic estimators without going deeper into difficult formulas yet.

Mean, median, standard deviation, range, ...

Let's look at an example:

Suppose we have many baby duck families:

  • Duck family 1: 5 baby ducks
  • Duck family 2: 1 baby ducks
  • Duck family 3: 3 baby ducks
  • Duck family 4: 5 baby ducks
  • Duck family 5: 2 baby ducks
  • Duck family 6: 4 baby ducks
  • Duck family 7: 22 baby ducks
InĀ [3]:
ducks = [5, 1, 3, 5, 2, 4, 22]

ducks
Out[3]:
[5, 1, 3, 5, 2, 4, 22]

Ā MeanĀ¶

The mean is used to summarize a data set. It is a measure of the center of a data set.

We can think of the mean as the number of baby ducks each mamma duck would have if they were equally distributed among all the mamma ducks.

InĀ [4]:
ducks
Out[4]:
[5, 1, 3, 5, 2, 4, 22]
InĀ [5]:
ducks_sum = 5 + 1 + 3 + 5 + 2 + 4 + 22

ducks_sum
Out[5]:
42
InĀ [6]:
mean = ducks_sum / len(ducks)

mean
Out[6]:
6.0
InĀ [7]:
np.mean(ducks)
Out[7]:
6.0
InĀ [13]:
plt.figure(figsize=(12,6))

plt.bar(["Family %s" % n for n in range(1, len(ducks)+1)], ducks)

# Mean line
plt.axhline(np.mean(ducks), color='r', linestyle=':', linewidth=2)
Out[13]:
<matplotlib.lines.Line2D at 0x7f395ff35518>

Ā MedianĀ¶

The median is the the middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).

InĀ [116]:
ducks
Out[116]:
[5, 1, 3, 5, 2, 4, 22]

First we'll order that values:

$$ 1, 2, 3, 4, 5, 5, 22 $$

Then we get the middle of all the values: 4

And that is our median.

InĀ [121]:
np.median(ducks)
Out[121]:
4.0
InĀ [14]:
plt.figure(figsize=(12,6))

plt.bar(["Family %s" % n for n in range(1, len(ducks)+1)], ducks)

# Mean line
plt.axhline(np.mean(ducks), color='r', linestyle=':', linewidth=2)

# Median line
plt.axhline(np.median(ducks), color='orange', linestyle=':', linewidth=2)
Out[14]:
<matplotlib.lines.Line2D at 0x7f395feeb1d0>
InĀ [26]:
plt.figure(figsize=(12,6))

pd.Series(ducks).value_counts().sort_index().plot(kind='bar')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f395e51cf98>
InĀ [27]:
plt.figure(figsize=(12,6))

pd.Series(ducks).value_counts().sort_index().reindex(list(range(23))).fillna(0).plot(kind='bar')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f395f0af5f8>
InĀ [164]:
plt.figure(figsize=(14,6))

sns.distplot(ducks, hist=False)

# Mean line
plt.axvline(np.mean(ducks), color='r', linestyle=':', linewidth=2)

# Median line
plt.axvline(np.median(ducks), color='orange', linestyle=':', linewidth=2)
Out[164]:
<matplotlib.lines.Line2D at 0x7f5e3427b160>

Ā ModeĀ¶

The mode is the most frequent numberā€”that is, the number that occurs the highest number of times.

InĀ [123]:
ducks
Out[123]:
[5, 1, 3, 5, 2, 4, 22]
InĀ [28]:
plt.figure(figsize=(12,6))

pd.Series(ducks).value_counts().sort_index().reindex(list(range(23))).fillna(0).plot(kind='bar')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f395f151828>
InĀ [31]:
from statistics import mode

mode(ducks)
Out[31]:
5

Ā Comparing ducks with dogsĀ¶

Now we'll compare our duck families to dog families:

  • Dog family 1: 6 puppies
  • Dog family 2: 3 puppies
  • Dog family 3: 7 puppies
  • Dog family 4: 8 puppies
  • Dog family 5: 4 puppies
  • Dog family 6: 6 puppies
  • Dog family 7: 8 puppies
InĀ [35]:
dogs = [6, 3, 7, 8, 4, 6, 8]

dogs
Out[35]:
[6, 3, 7, 8, 4, 6, 8]

Let's analyze puppies mean per family:

InĀ [36]:
dogs_sum = 6 + 3 + 7 + 8 + 4 + 6 + 8

dogs_sum
Out[36]:
42
InĀ [37]:
dogs_sum / len(dogs)
Out[37]:
6.0
InĀ [38]:
np.mean(dogs)
Out[38]:
6.0
InĀ [39]:
np.mean(ducks)
Out[39]:
6.0
InĀ [40]:
plt.figure(figsize=(12,6))

pd.Series(dogs).value_counts().sort_index().reindex(list(range(max(dogs) + 1))).fillna(0).plot(kind='bar')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f395f29c630>

We have the same mean on ducks and ducks, but both families has the same dispersion of babies?


Ā RangeĀ¶

The range is the difference between the lowest and highest values:

InĀ [41]:
print(min(ducks), '---', max(ducks))
1 --- 22
InĀ [42]:
ducks_range = max(ducks) - min(ducks)

ducks_range
Out[42]:
21
InĀ [43]:
print(min(dogs), '---', max(dogs))
3 --- 8
InĀ [44]:
dogs_range = max(dogs) - min(dogs)

dogs_range
Out[44]:
5

We see here that ducks has four times higher dispersion than dogs!

green-divider

Why stats?Ā¶

The objective of learning stats is to quickly understand the data you're working with, by just looking at the estimators described above. This requires experience, but after some time you'll be able to "picture" the shape of your data in your brain by just looking at these numbers.

All these estimators will affect "the shape" of the distribution, and that will have different impacts in your data:

  • Mean
  • Median
  • Mode
  • Standard deviation
  • Percentiles
  • Ranges
  • IQR

We saw in the previous example, how two "distributions", ducks and dogs, had the same mean (6) but the numbers looked completely different. Duck families ranged from 1 to 22 while dogs ranged from 3 to 8. This will have a deep impact on your data analysis tasks later. For example, while cleaning data, if you find a dog family with 18 puppies, you'll probably need to investigate a little bit more (18 babies sounds unrealistic for dogs).

Let's see now how these different estimators will change the shape of the distributions.

Altering the MeanĀ¶

InĀ [213]:
plt.figure(figsize=(14,6))

sns.distplot(np.random.normal(0, 1, 100), hist=False)
plt.axvline(0, linestyle=':') # mean

sns.distplot(np.random.normal(2, 1, 100), hist=False)
plt.axvline(2, color='orange', linestyle=':') # mean

sns.distplot(np.random.normal(-2, 1, 100), hist=False)
plt.axvline(-2, color='green', linestyle=':') # mean
Out[213]:
<matplotlib.lines.Line2D at 0x7f5e33af6438>

Altering the Standard DeviationĀ¶

InĀ [214]:
plt.figure(figsize=(14,6))

data_1 = np.random.normal(0, 1, 100)
data_2 = np.random.normal(0, 2, 100)
data_3 = np.random.normal(0, 3, 100)

sns.distplot(data_1, hist=False)
sns.distplot(data_2, hist=False)
sns.distplot(data_3, hist=False)

plt.axvline(0, color='r', linestyle=':') # mean
Out[214]:
<matplotlib.lines.Line2D at 0x7f5e33b1cf60>

purple-divider

Notebooks AI
Notebooks AI Profile20060