# Atlas California

Data science problem from Atlas

Last updated: May 14th, 2020

# Data exploration¶

The data exploration is done in order to ensure that the data set does not contain anything weird and if so clean it up. This step also helps with ensuring that we choose a representative data for our training.

We start of by importing some libraries and load the data.

In [34]:
import pandas as pd
import matplotlib.pyplot as plt


In [35]:
desc=allData.describe()

print (desc)

         Unnamed: 0     longitude      latitude  housing_median_age  \
count  20640.000000  20640.000000  20640.000000        20640.000000
mean   10319.500000   -119.569704     35.631861           28.639486
std     5958.399114      2.003532      2.135952           12.585558
min        0.000000   -124.350000     32.540000            1.000000
25%     5159.750000   -121.800000     33.930000           18.000000
50%    10319.500000   -118.490000     34.260000           29.000000
75%    15479.250000   -118.010000     37.710000           37.000000
max    20639.000000   -114.310000     41.950000           52.000000

total_rooms  total_bedrooms   total_pools    population    households  \
count  20640.000000    20433.000000  20640.000000  20640.000000  20640.000000
mean    2635.763081      537.870553     49.457946   1425.476744    499.539680
std     2181.615252      421.385070     42.641988   1132.462122    382.329753
min        2.000000        1.000000      0.000000      3.000000      1.000000
25%     1447.750000      296.000000     24.000000    787.000000    280.000000
50%     2127.000000      435.000000     39.000000   1166.000000    409.000000
75%     3148.000000      647.000000     61.000000   1725.000000    605.000000
max    39320.000000     6445.000000    890.000000  35682.000000   6082.000000

median_income  median_house_value
count   20640.000000        20640.000000
mean        3.870671       206855.816909
std         1.899822       115395.615874
min         0.499900        14999.000000
25%         2.563400       119600.000000
50%         3.534800       179700.000000
75%         4.743250       264725.000000
max        15.000100       500001.000000


The first thing we notice is that we have a unnnamed column, this is most likely because when someone saved the data set they also saved the index as a column. So we remove that column.

In [36]:
allData=allData.drop('Unnamed: 0',axis=1)


The next thing we look at is if there are any empty fields/NaN in the data set.

In [37]:
allData.isna().sum()

Out[37]:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
total_pools             0
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Now we can see that the column total_bedrooms contains 207 empty fields/NaN, since it is a such a small percentage of the data we choose to remove those rows and reindex the dataset. If it was a lot larger we would most likely have to do something else, such as imputating new values, or we could go back to the supplier of the dataset and ask, or remove the column.

In [38]:
allData.dropna(inplace = True)
allData= allData.reset_index(drop = True)


Lets now perform a visual inspection of the data by plotting all of the columns.

In [39]:
headers=list(allData)

plt.figure()
plt.show()


There is a couple of things we can see from these plots.

When looking at the longitude and latitude plots we can see that the data is most likely ordered by time/datetime a house has been sold/evaluated. This can be seen by the fact that areas are very close to each other in the data. This can be seen most clearly when the x-axis is around 5000. What this means for our training / solution is that when chosing validation and test set we need to randomly sample from the whole dataset and not just use the last 10%. But this also makes it possible for us to create a new feature which is timeID.

In order to improve the dataset we also want to remove outliers that are not representative of the dataset. In the plot total_pools we can see that we have one outlier at 850. In population we have 2 outliers, one at 28000 and one at 35000. We can also see that we only have one instance of ISLAND in the ocean_proximity plot. All of these rows are now removed.

In [40]:
allData=allData[allData.total_pools < 800]
allData=allData[allData.population < 25000]
allData=allData[allData.ocean_proximity != 'ISLAND']
allData= allData.reset_index(drop = True)

In [41]:
headers=list(allData)

plt.figure()

allData.to_csv('cleanedData.csv', index=False)