Data exploration¶
The data exploration is done in order to ensure that the data set does not contain anything weird and if so clean it up. This step also helps with ensuring that we choose a representative data for our training.
We start of by importing some libraries and load the data.
import pandas as pd
import matplotlib.pyplot as plt
allData=pd.read_csv('californiahousing.csv')
desc=allData.describe()
print (desc)
The first thing we notice is that we have a unnnamed column, this is most likely because when someone saved the data set they also saved the index as a column. So we remove that column.
allData=allData.drop('Unnamed: 0',axis=1)
The next thing we look at is if there are any empty fields/NaN in the data set.
allData.isna().sum()
Now we can see that the column total_bedrooms contains 207 empty fields/NaN, since it is a such a small percentage of the data we choose to remove those rows and reindex the dataset. If it was a lot larger we would most likely have to do something else, such as imputating new values, or we could go back to the supplier of the dataset and ask, or remove the column.
allData.dropna(inplace = True)
allData= allData.reset_index(drop = True)
Lets now perform a visual inspection of the data by plotting all of the columns.
headers=list(allData)
for header in headers:
plt.figure()
plt.plot(allData[header],'*')
plt.title(header)
plt.show()
There is a couple of things we can see from these plots.
When looking at the longitude and latitude plots we can see that the data is most likely ordered by time/datetime a house has been sold/evaluated. This can be seen by the fact that areas are very close to each other in the data. This can be seen most clearly when the x-axis is around 5000. What this means for our training / solution is that when chosing validation and test set we need to randomly sample from the whole dataset and not just use the last 10%. But this also makes it possible for us to create a new feature which is timeID.
In order to improve the dataset we also want to remove outliers that are not representative of the dataset. In the plot total_pools we can see that we have one outlier at 850. In population we have 2 outliers, one at 28000 and one at 35000. We can also see that we only have one instance of ISLAND in the ocean_proximity plot. All of these rows are now removed.
allData=allData[allData.total_pools < 800]
allData=allData[allData.population < 25000]
allData=allData[allData.ocean_proximity != 'ISLAND']
allData= allData.reset_index(drop = True)
headers=list(allData)
for header in headers:
plt.figure()
plt.plot(allData[header],'*')
plt.title(header)
plt.show()
After plotting the data again we can see that we have successfully removed the outliers and we have done a quick cleaning of the data. We can now save the data to use it to train our models. Of course we can further analyze our data to extract more information to create new informative features for our models but we will stop here.
allData.to_csv('cleanedData.csv', index=False)