Profile picture

MLF - Tuning Diabetes Prediction Model

Last updated: August 12th, 20202020-08-12Project preview

rmotr


Tuning diabetes prediction model

In this project, we'll focused in two key concept: Cross-validation and Tunning Hyper-parameters to achieve the best accuracy of the model.

We will continue working with the Diabetes dataset, which have 8 numeric features plus a 0-1 class label.

separator2

Hands on!

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

 Load the data/diabetes_3.csv file, and store it into diabetes_df DataFrame.

This file has already wrong observations removed, and it is balanced.

In [ ]:
# your code goes here
diabetes_df = None
In [ ]:
diabetes_df = pd.read_csv('data/diabetes_3.csv')

diabetes_df.head()

Show the shape of the resulting diabetes_df.

In [ ]:
# your code goes here
In [ ]:
diabetes_df.shape

green-divider

Data preparation

Before modeling prepare the data:

Create features $X$ and labels $y$

In [ ]:
# your code goes here
X = None
y = None
In [ ]:
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']

 Stantardize the features

Use the StandardScaler to standardize the features (X) before moving to model creation.

In [ ]:
from sklearn.preprocessing import StandardScaler

# your code goes here
In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)

green-divider

 Model creation and cross-validation evaluation

Build a get_kneighbors_score function that receives:

  • X: features
  • y: label
  • k: neighbors

This function should train a KNeighborsClassifier and returns the mean and standard deviation of the scores of a 4-fold Cross-validation.

In [ ]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

def get_kneighbors_score(X, y, k):
    # your code goes here
    pass
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

def get_kneighbors_score(X, y, k):
    model = KNeighborsClassifier(n_neighbors=k)

    scores = cross_val_score(model, X, y, cv=4)
    
    return (scores.mean(), scores.std())

Test your function

Use the whole data to test your get_kneighbors_score() function.

Print scores obtained by using 5, 10 and 15 neighbors (k).

In [ ]:
# your code goes here
In [ ]:
print(f"Using 5 neighbors: {get_kneighbors_score(X, y, 5)}")
print(f"Using 10 neighbors: {get_kneighbors_score(X, y, 10)}")
print(f"Using 15 neighbors: {get_kneighbors_score(X, y, 15)}")

Let's try to get the best k value.

green-divider

Getting the best amount of neighbors

Train a KNN to test different values of k.

Keep using a KNeighborsClassifier estimator and a 4-fold Cross-validation.

Test the following k values:

In [ ]:
parameters = [1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]

# your code goes here
def get_kneighbors_score(k):
    model = None
    scores = None
    return None

ACC_dev = None

#for k in None:
#    scores=None
#    ACC_dev.append(scores)
In [ ]:
parameters = [1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]

def get_kneighbors_score(k):
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X, y, cv=4)
    return scores.mean()

ACC_dev = []
for k in parameters:
    scores=get_kneighbors_score(k)
    ACC_dev.append(scores)
    
ACC_dev

Getting the validation curve

Plot the validation curve (testing accuracy versus k). Which is the best k parameter?

In [ ]:
# your code goes here
In [ ]:
parameters
In [ ]:
ACC_dev
In [ ]:
f, ax = plt.subplots(figsize=(10,5))

plt.plot(parameters, ACC_dev, 'o-')
plt.xlabel('Neighbors')
plt.ylabel('Accuracy')

plt.grid()
plt.plot()

separator2

Notebooks AI
Notebooks AI Profile20060