Assignment 1.3.1: Estimating Probabilities

Last updated: March 31st, 20202020-03-31Project preview
In [32]:
import pandas as pd
import numpy as np
In [33]:
toy_dataset = pd.read_csv('toy_dataset.csv')
toy_dataset.head()
Out[33]:
Number City Gender Age Income Illness
0 1 Dallas Male 41 40367.0 No
1 2 Dallas Male 54 45084.0 No
2 3 Dallas Male 42 52483.0 No
3 4 Dallas Male 40 40941.0 No
4 5 Dallas Male 46 50289.0 No

Example 1: Let's estimate the probabilities for picking a person who lives in Dallas calculating the proportion of times we get this result. We select for example 100 rows aleatory:

In [34]:
data = toy_dataset.sample(120)
data
Out[34]:
Number City Gender Age Income Illness
83629 83630 Los Angeles Male 43 99712.0 No
73105 73106 Los Angeles Female 60 93121.0 No
119389 119390 Boston Male 34 101764.0 Yes
30482 30483 New York City Female 44 97664.0 No
43293 43294 New York City Female 32 84656.0 No
... ... ... ... ... ... ...
16749 16750 Dallas Female 26 41147.0 No
10310 10311 Dallas Male 48 47675.0 No
126321 126322 Washington D.C. Female 53 66423.0 No
74379 74380 Los Angeles Female 41 82529.0 No
130398 130399 Washington D.C. Female 65 65366.0 No

120 rows × 6 columns

We take the rows that the city is Dallas:

In [35]:
data1 = data[data["City"]=="Dallas"]
In [36]:
data1.shape
Out[36]:
(13, 6)

There are data1.shape[0] rows in this conditions

In [37]:
proportion = data1.shape[0]/120
proportion
Out[37]:
0.10833333333333334
In [38]:
round(proportion, 2) #round the answer to 2 decimal
Out[38]:
0.11

or

In [39]:
proportion_ = len(data1)/120
round(proportion_, 2)
Out[39]:
0.11

Example 2: Now we can calculate the proportion of person or rows of data who have more than 30 years:

In [40]:
data2 = data[data["Age"]>30]
In [41]:
Proportion = len(data2)/120
round(Proportion, 2) #round the answer to 2 decimal
Out[41]:
0.82

Exercise 1: Estimate the probability with 1000 trials of pick a person they're earning less than $50000. Round your answer to 1 decimal place.

In [ ]:
 
In [59]:
data_1000 = toy_dataset.sample(1000)
data3 = data_1000[data_1000["Income"]<50000]
proportion = data3.shape[0]/1000
print("Empirical probability: ",round(proportion, 1))
Empirical probability:  0.1
In [58]:
def people_earning_less_than_50000(sample):
    income_50000 = 0
    for v in sample.values:
        if v[0]<50000:
            income_50000 += 1
    return income_50000  
data4 = toy_dataset.sample(1000)['Income'].to_frame()
proportion = people_earning_less_than_50000(data4)/1000
print("Empirical probability: ",round(proportion,1))
Empirical probability:  0.1

Exercise 2: Estimate the probability with 1000 trials of pick three people and at least one of them being from Boston. Round your answer to 1 decimal place.

In [ ]:
 
In [57]:
at_least_one_from_Boston = 0
for n in range(1000):
    if 'Boston' in toy_dataset.sample(3)['City'].values:
        at_least_one_from_Boston += 1
print("Number of experiment that have at least one from Boston:",at_least_one_from_Boston)
proportion_Boston = at_least_one_from_Boston/1000
print("Empirical probability: ",proportion_Boston)
Number of experiment that have at least one from Boston: 151
Empirical probability:  0.151
In [46]:
#Using that we haven't seen yet
data6 = toy_dataset.sample(1000)['City'].value_counts().to_frame()
data6
Out[46]:
City
New York City 319
Los Angeles 196
Dallas 150
Mountain View 92
Austin 88
Boston 57
Washington D.C. 55
San Diego 43
In [47]:
proportion_Boston = data6.loc['Boston'][0]/1000
proportion_Boston
Out[47]:
0.057
In [48]:
proportion_NoBoston = (data6.loc['New York City'][0]+data6.loc['Los Angeles'][0]+data6.loc['Dallas'][0]+data6.loc['Mountain View'][0]+data6.loc['Austin'][0]+data6.loc['Washington D.C.'][0]+data6.loc['San Diego'][0])/1000
proportion_NoBoston
Out[48]:
0.943
In [49]:
(proportion_Boston)*(proportion_NoBoston)**2+(proportion_Boston)**2*(proportion_NoBoston)+(proportion_Boston)**3
Out[49]:
0.053936193
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
Notebooks AI
Notebooks AI Profile20060