# Assignment 1.3.1: Estimating Probabilities

Last updated: March 10th, 2020

# Assignment 1.3.1: Estimating Probabilities¶

In :
import pandas as pd
import numpy as np

In :
toy_dataset = pd.read_csv('toy_dataset.csv')

Out:
Number City Gender Age Income Illness
0 1 Dallas Male 41 40367.0 No
1 2 Dallas Male 54 45084.0 No
2 3 Dallas Male 42 52483.0 No
3 4 Dallas Male 40 40941.0 No
4 5 Dallas Male 46 50289.0 No

Let's estimate the probabilities for picking a person who lives in Dallas calculating the proportion of times we get this result. We select for example 100 rows aleatory:

In :
data = toy_dataset.sample(120)
data

Out:
Number City Gender Age Income Illness
21108 21109 New York City Male 41 107470.0 No
77104 77105 Los Angeles Male 63 92314.0 No
146451 146452 Austin Female 55 77330.0 No
50498 50499 New York City Male 54 103988.0 No
15727 15728 Dallas Female 43 45464.0 No
... ... ... ... ... ... ...
10840 10841 Dallas Male 35 53448.0 Yes
10945 10946 Dallas Male 34 48523.0 Yes
32569 32570 New York City Female 41 94292.0 No
86539 86540 Los Angeles Female 47 69357.0 No
141948 141949 Austin Male 43 84142.0 No

120 rows × 6 columns

We take the rows that the city is Dallas:

In :
data1 = data[data["City"]=="Dallas"]

In :
len(data1)

Out:
18

There are 15 rows in this conditions

In :
proportion = len(data1)/120

In :
proportion

Out:
0.15
In :
round(proportion, 2) #round the answer to 2 decimal places

Out:
0.15

Now we can calculate the proportion of person or rows of data who have more than 30 years:

In :
data2 = data[data["Age"]>30]

In :
len(data2)

Out:
103
In :
proportion = len(data2)/120

In :
round(proportion, 2) #round the answer to 2 decimal places

Out:
0.86

Estimate the probability with 1000 trials of pick a person they're earning less than \$50000. Round your answer to 1 decimal place.

In [ ]:


In :
data_1000 = toy_dataset.sample(1000)
data3 = data_1000[data_1000["Income"]<50000]
proportion = len(data3)/1000
round(proportion, 1)

Out:
0.1
In :
data4 = toy_dataset.sample(1000)['Income'].to_frame()
data4

Out:
Income
42581 98466.0
88623 116629.0
83446 96775.0
58112 77553.0
21035 83526.0
... ...
107981 133505.0
74728 83549.0
43111 86438.0
108641 114697.0
2865 54306.0

1000 rows × 1 columns

In :
def people_earning_less_than_50000(sample):
Income_50000 = 0
for i in sample.index:
if sample.loc[i]<50000:
Income_50000 += 1
return Income_50000
data4 = toy_dataset.sample(1000)['Income'].to_frame()
proportion = people_earning_less_than_50000(data4)/1000
round(proportion,1)

Out:
0.1
In :
#This version is faster because it only goes through the DataFrame once
def people_earning_less_than_50000(sample):
income_50000 = 0
for v in sample.values:
if v<50000:
income_50000 += 1
return income_50000
data4 = toy_dataset.sample(1000)['Income'].to_frame()
proportion = people_earning_less_than_50000(data4)/1000
round(proportion,1)

Out:
0.1

Estimate the probability with 1000 trials of pick three people and at least one of them being from Boston. Round your answer to 1 decimal place.

In :
at_least_one_from_Boston = 0
for n in range(1000):
if 'Boston' in toy_dataset.sample(3)['City'].values:
at_least_one_from_Boston += 1
at_least_one_from_Boston
prop = at_least_one_from_Boston/1000
prop

Out:
0.151
In [ ]:


In :
data6 = toy_dataset.sample(1000)['City'].value_counts().to_frame()
data6

Out:
City
New York City 333
Los Angeles 190
Dallas 124
Mountain View 99
Austin 95
Boston 73
Washington D.C. 52
San Diego 34
In :
proportion_Boston = data6.loc['Boston']/1000
proportion_Boston

Out:
0.073
In :
proportion_NoBoston = (data6.loc['New York City']+data6.loc['Los Angeles']+data6.loc['Dallas']+data6.loc['Mountain View']+data6.loc['Austin']+data6.loc['Washington D.C.']+data6.loc['San Diego'])/1000
proportion_NoBoston

Out:
0.927
In :
(proportion_Boston)*(proportion_NoBoston)**2+2*(proportion_Boston)**2*(proportion_NoBoston)+(proportion_Boston)**3

Out:
0.07300000000000001
In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: