Atlas California

Last updated: May 14th, 20202020-05-14Project preview

Data exploration

The data exploration is done in order to ensure that the data set does not contain anything weird and if so clean it up. This step also helps with ensuring that we choose a representative data for our training.

We start of by importing some libraries and load the data.

In [34]:
import pandas as pd
import matplotlib.pyplot as plt

allData=pd.read_csv('californiahousing.csv')
In [35]:
desc=allData.describe()

print (desc)
         Unnamed: 0     longitude      latitude  housing_median_age  \
count  20640.000000  20640.000000  20640.000000        20640.000000   
mean   10319.500000   -119.569704     35.631861           28.639486   
std     5958.399114      2.003532      2.135952           12.585558   
min        0.000000   -124.350000     32.540000            1.000000   
25%     5159.750000   -121.800000     33.930000           18.000000   
50%    10319.500000   -118.490000     34.260000           29.000000   
75%    15479.250000   -118.010000     37.710000           37.000000   
max    20639.000000   -114.310000     41.950000           52.000000   

        total_rooms  total_bedrooms   total_pools    population    households  \
count  20640.000000    20433.000000  20640.000000  20640.000000  20640.000000   
mean    2635.763081      537.870553     49.457946   1425.476744    499.539680   
std     2181.615252      421.385070     42.641988   1132.462122    382.329753   
min        2.000000        1.000000      0.000000      3.000000      1.000000   
25%     1447.750000      296.000000     24.000000    787.000000    280.000000   
50%     2127.000000      435.000000     39.000000   1166.000000    409.000000   
75%     3148.000000      647.000000     61.000000   1725.000000    605.000000   
max    39320.000000     6445.000000    890.000000  35682.000000   6082.000000   

       median_income  median_house_value  
count   20640.000000        20640.000000  
mean        3.870671       206855.816909  
std         1.899822       115395.615874  
min         0.499900        14999.000000  
25%         2.563400       119600.000000  
50%         3.534800       179700.000000  
75%         4.743250       264725.000000  
max        15.000100       500001.000000  

The first thing we notice is that we have a unnnamed column, this is most likely because when someone saved the data set they also saved the index as a column. So we remove that column.

In [36]:
allData=allData.drop('Unnamed: 0',axis=1)

The next thing we look at is if there are any empty fields/NaN in the data set.

In [37]:
allData.isna().sum()
Out[37]:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
total_pools             0
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Now we can see that the column total_bedrooms contains 207 empty fields/NaN, since it is a such a small percentage of the data we choose to remove those rows and reindex the dataset. If it was a lot larger we would most likely have to do something else, such as imputating new values, or we could go back to the supplier of the dataset and ask, or remove the column.

In [38]:
allData.dropna(inplace = True)  
allData= allData.reset_index(drop = True)

Lets now perform a visual inspection of the data by plotting all of the columns.

In [39]:
headers=list(allData)

for header in headers:
    plt.figure()
    plt.plot(allData[header],'*')
    plt.title(header)
    plt.show()

There is a couple of things we can see from these plots.

When looking at the longitude and latitude plots we can see that the data is most likely ordered by time/datetime a house has been sold/evaluated. This can be seen by the fact that areas are very close to each other in the data. This can be seen most clearly when the x-axis is around 5000. What this means for our training / solution is that when chosing validation and test set we need to randomly sample from the whole dataset and not just use the last 10%. But this also makes it possible for us to create a new feature which is timeID.

In order to improve the dataset we also want to remove outliers that are not representative of the dataset. In the plot total_pools we can see that we have one outlier at 850. In population we have 2 outliers, one at 28000 and one at 35000. We can also see that we only have one instance of ISLAND in the ocean_proximity plot. All of these rows are now removed.

In [40]:
allData=allData[allData.total_pools < 800]
allData=allData[allData.population < 25000]
allData=allData[allData.ocean_proximity != 'ISLAND']
allData= allData.reset_index(drop = True)
In [41]:
headers=list(allData)

for header in headers:
    plt.figure()
    plt.plot(allData[header],'*')
    plt.title(header)
    plt.show()
    

After plotting the data again we can see that we have successfully removed the outliers and we have done a quick cleaning of the data. We can now save the data to use it to train our models. Of course we can further analyze our data to extract more information to create new informative features for our models but we will stop here.

In [43]:
allData.to_csv('cleanedData.csv', index=False)
In [ ]:
 
Notebooks AI
Notebooks AI Profile20060