Tuning diabetes prediction model¶
In this project, we'll focused in two key concept: Cross-validation and Tunning Hyper-parameters to achieve the best accuracy of the model.
We will continue working with the Diabetes dataset, which have 8 numeric features plus a 0-1 class label.
Hands on!¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the diabetes_3.csv
dataset, and store it into diabetes_df
.¶
This file has already wrong observations removed, and it is balanced.
# your code goes here
diabetes_df = None
diabetes_df = pd.read_csv('diabetes_3.csv')
diabetes_df.head()
Show the shape of the resulting diabetes_df
.¶
# your code goes here
diabetes_df.shape
# your code goes here
X = None
y = None
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']
Stantardize the features¶
Use the StandardScaler
to standardize the features (X
) before moving to model creation.
from sklearn.preprocessing import StandardScaler
# your code goes here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Model creation and cross-validation evaluation¶
Build a get_kneighbors_score
function that receives:
X
: featuresy
: labelk
: neighbors
This function should train a KNeighborsClassifier
and returns the mean and standard deviation of the scores of a 4-fold Cross-validation.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
def get_kneighbors_score(X, y, k):
# your code goes here
pass
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
def get_kneighbors_score(X, y, k):
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=4)
return (scores.mean(), scores.std())
Test your function¶
Use the whole data to test your get_kneighbors_score()
function.
Print scores obtained by using 5
, 10
and 15
neighbors (k
).
# your code goes here
print(f"Using 5 neighbors: {get_kneighbors_score(X, y, 5)}")
print(f"Using 10 neighbors: {get_kneighbors_score(X, y, 10)}")
print(f"Using 15 neighbors: {get_kneighbors_score(X, y, 15)}")
Let's try to get the best k
value.
Getting the best amount of neighbors¶
Train a KNN to test different values of k
.
Keep using a KNeighborsClassifier
estimator and a 4-fold Cross-validation.
Test the following k
values:
parameters = [1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
# your code goes here
def get_kneighbors_score(k):
model = None
scores = None
return None
ACC_dev = None
#for k in None:
# scores=None
# ACC_dev.append(scores)
parameters = [1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
def get_kneighbors_score(k):
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=4)
return scores.mean()
ACC_dev = []
for k in parameters:
scores=get_kneighbors_score(k)
ACC_dev.append(scores)
print(ACC_dev)
Getting the Validation curves¶
Plot the Validation curves (testing accuracy versus k). Which is the best k
parameter?
# your code goes here
f, ax = plt.subplots(figsize=(10,5))
plt.plot(parameters,ACC_dev,'o-')
plt.axvline(x=12, ymin=0, ymax=1,color='k')
plt.xlabel('Neighbors')
plt.ylabel('Accuracy')
plt.grid()
plt.legend()
plt.plot()
print( 'Best parameters:', 12, 'Accuracy:' , 0.75)