# 1.3. Probability

Last updated: January 13th, 2020
In :
import pandas as pd


# Probability¶

The two random experiments we have performed so far as examples were to pick a person and print their city, and to pick a person and print their age:

In :
dataset =  pd.DataFrame({
'Person #':[1,2,3,4,5,6,7,8,9,10],
'City':['SF','SF','NY','NY','NY','SF','NY','SF','SF','SF'],
'Age':[41,26,28,53,32,51,65,49,25,33]
})
dataset

Out:
Person # City Age
0 1 SF 41
1 2 SF 26
2 3 NY 28
3 4 NY 53
4 5 NY 32
5 6 SF 51
6 7 NY 65
7 8 SF 49
8 9 SF 25
9 10 SF 33
In :
dataset.sample(1)['City'].values

Out:
'NY'
In :
dataset.sample(1)['Age'].values

Out:
32

Now, even though in both cases we cannot be certain about the outcome, we can clearly see that getting 'NY' is much more likely than getting '51'.

The likelihood of an outcome is what we call the theoretical probability of that outcome.

Formally, we will write $P(\omega)$ for $\omega\in\Omega$.

When we perform an experiment, we can assign this theoretical probability to each outcome, for example the probability of getting heads when we toss a fair coin is $1/2$.

In following videos we will learn how to calculate these probabilities for different problems.

But if we only have data it can be difficult to know exactly what the theoretical probability of a given outcome is.

So another way of defining the probability of an outcome is by performing the random experiment a large number of times and calculating the proportion of times we get our outcome, also known as the relative frequency of that outcome.

This is what we call the empirical probability of the outcome.

Let's explore this with a bigger dataset:

In :
toy_dataset = pd.read_csv('toy_dataset.csv')

Out:
Number City Gender Age Income Illness
0 1 Dallas Male 41 40367.0 No
1 2 Dallas Male 54 45084.0 No
2 3 Dallas Male 42 52483.0 No
3 4 Dallas Male 40 40941.0 No
4 5 Dallas Male 46 50289.0 No

In this dataset we not only have the city and age of each person, but also their gender, income and whether they have an illness.

In :
toy_dataset.shape

Out:
(150000, 6)

We can also check that the dataset has $150000$ rows (or people).

So let's explore the notion of empirical probability by carrying out an experiment a large number of times (for example $100$), and calculating the proportion of times we get a particular result, for example picking a person and the gender being 'Female':

In :
data = toy_dataset.sample(100)['Gender'].value_counts().to_frame()
data

Out:
Gender
Male 56
Female 44
In :
proportion = data.loc['Female']/100
proportion

Out:
0.44

We can see, by repeatedly running this code, that this proportion constantly changes. So we could ask ourselves, what is a large number of times?

In theory, if we could perform the experiment infinitely many times, we would arrive at the theoretical probability (actually this is a result that we will see in the last course).

Of course in real life there is no way we can perform an experiment infinitely many times, but let's try to get an intuition of this result, by comparing the proportions we get when we perform the experiment an increasing number of times:

In :
times = [100,1000,10000,100000]

proportions = {}

for t in times:
experiments = []
for experiment_id in range(10):
data = toy_dataset.sample(t)['Gender'].value_counts().to_frame()
experiments.append(data.loc['Female'] / t)
proportions[f"Times: {t}"] = experiments

results = pd.DataFrame(proportions)
results.index = ['Experiment %s' % i for i in range(1, 11)]

results

Out:
Times: 100 Times: 1000 Times: 10000 Times: 100000
Experiment 1 0.45 0.473 0.4387 0.44093
Experiment 2 0.46 0.437 0.4517 0.44101
Experiment 3 0.45 0.469 0.4463 0.43989
Experiment 4 0.40 0.466 0.4444 0.44060
Experiment 5 0.43 0.444 0.4386 0.44239
Experiment 6 0.40 0.438 0.4421 0.44096
Experiment 7 0.40 0.460 0.4419 0.44106
Experiment 8 0.47 0.440 0.4469 0.44099
Experiment 9 0.41 0.430 0.4387 0.44170
Experiment 10 0.38 0.455 0.4358 0.44050

What we are doing here is calculating the proportion of 'Female' we get when performing of the experiment $100$ times, $1000$ times, $10000$ times and $100000$ times, and making $10$ repetitions of each to compare the results.

The code builds a dictionary that maps the amount of times we are performing the experiment with a list of the $10$ proportions we get each time we do that.

What we can see is that the more times we perform the experiment, the more consistent the proportion gets, showing that it is approaching the theoretical probability when the amount of times we perform the experiment grows towards infinity.

Now it's your turn to explore some relative frequencies to estimate probabilities!