```
import pandas as pd
```

# Probability¶

The two random experiments we have performed so far as examples were to pick a person and print their city, and to pick a person and print their age:

```
dataset = pd.DataFrame({
'Person #':[1,2,3,4,5,6,7,8,9,10],
'City':['SF','SF','NY','NY','NY','SF','NY','SF','SF','SF'],
'Age':[41,26,28,53,32,51,65,49,25,33]
})
dataset
```

```
dataset.sample(1)['City'].values[0]
```

```
dataset.sample(1)['Age'].values[0]
```

Now, even though in both cases we cannot be certain about the outcome, we can clearly see that getting 'NY' is much more likely than getting '51'.

**The likelihood of an outcome** is what we call the *theoretical probability* of that outcome.

Formally, we will write $P(\omega)$ for $\omega\in\Omega$.

When we perform an experiment, we can assign this theoretical probability to each outcome, for example the probability of getting heads when we toss a fair coin is $1/2$.

In following videos we will learn how to calculate these probabilities for different problems.

But if we only have data it can be difficult to know exactly what the theoretical probability of a given outcome is.

So another way of defining the probability of an outcome is by performing the random experiment a large number of times and calculating the proportion of times we get our outcome, also known as **the relative frequency of that outcome**.

This is what we call the *empirical probability* of the outcome.

Let's explore this with a bigger dataset:

```
toy_dataset = pd.read_csv('toy_dataset.csv')
toy_dataset.head()
```

In this dataset we not only have the city and age of each person, but also their gender, income and whether they have an illness.

```
toy_dataset.shape
```

We can also check that the dataset has $150000$ rows (or people).

So let's explore the notion of empirical probability by carrying out an experiment a large number of times (for example $100$), and calculating the proportion of times we get a particular result, for example picking a person and the gender being 'Female':

```
data = toy_dataset.sample(100, replace=True)['Gender'].value_counts().to_frame()
data
```

```
proportion = data.loc['Female'][0]/100
proportion
```

We can see, by repeatedly running this code, that this proportion constantly changes. So we could ask ourselves, what is a *large* number of times?

In theory, if we could perform the experiment infinitely many times, we would arrive at the theoretical probability (actually this is a result that we will see in the last course).

Of course in real life there is no way we can perform an experiment infinitely many times, but let's try to get an intuition of this result, by comparing the proportions we get when we perform the experiment an increasing number of times:

```
times = [100,1000,10000,100000]
proportions = {}
for t in times:
experiments = []
for experiment_id in range(10):
data = toy_dataset.sample(t, replace=True)['Gender'].value_counts().to_frame()
experiments.append(data.loc['Female'][0] / t)
proportions[f"Times: {t}"] = experiments
results = pd.DataFrame(proportions)
results.index = ['Experiment %s' % i for i in range(1, 11)]
results
```

What we are doing here is calculating the proportion of 'Female' we get when performing of the experiment $100$ times, $1000$ times, $10000$ times and $100000$ times, and making $10$ repetitions of each to compare the results.

The code builds a dictionary that maps the amount of times we are performing the experiment with a list of the $10$ proportions we get each time we do that.

What we can see is that the more times we perform the experiment, the more consistent the proportion gets, showing that it is approaching the theoretical probability when the amount of times we perform the experiment grows towards infinity.

Now it's your turn to explore some relative frequencies to estimate probabilities!