MLF - Balancing Diabetes Observations

Last updated: July 13th, 20202020-07-13Project preview

rmotr


Balancing diabetes observations

Now we will continue using the Diabetes dataset, which have 8 numeric features plus a 0-1 class label.

We'll analyze if the data is balanced before training our model and how are the errors that the model make.

separator2

Hands on!

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

 Load the diabetes_2.csv dataset, and store it into diabetes_df.

This file has already wrong observations removed.

In [ ]:
# your code goes here
diabetes_df = None
In [ ]:
diabetes_df = pd.read_csv('diabetes_2.csv')

diabetes_df.head()

Show the shape of the resulting diabetes_df.

In [ ]:
# your code goes here
In [ ]:
diabetes_df.shape

green-divider

Analyze label distribution

Are observations well balanced?

How many observations we there are for 0 (no diabetes) and 1 (yes diabetes)?

In [ ]:
# your code goes here
In [ ]:
diabetes_df['label'].value_counts()

Show a barplot displaying these values:

In [ ]:
# your code goes here
In [ ]:
diabetes_df['label'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

Balancing data

As observations are imbalanced, you will need to balance them.

Your task: down-sample the majority class by randomly removing 0 (no diabetes) observations.

Step 1

Separate observations from each class:

In [ ]:
# your code goes here
In [ ]:
no_diabetes = diabetes_df[diabetes_df['label'] == 0]
yes_diabetes = diabetes_df[diabetes_df['label'] == 1]

no_diabetes.shape, yes_diabetes.shape

 Step 2

Resample the majority class (no diabetes) without replacement to match the number of samples of the minority class.

In [ ]:
from sklearn.utils import resample

# your code goes here
In [ ]:
from sklearn.utils import resample

no_diabetes_downsampled = resample(no_diabetes, 
                                   replace=False,
                                   n_samples=yes_diabetes.shape[0],
                                   random_state=1)

 Step 3

Concatenate the minority class and the new re-sampled majority class.

In [ ]:
# your code goes here
In [ ]:
diabetes_df = pd.concat([no_diabetes_downsampled, yes_diabetes])

 Step 4

Analyze label distribution again to validate that your data is now balanced.

In [ ]:
# your code goes here
In [ ]:
diabetes_df['label'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

Modeling with the balanced data

We will keep using a k-nearest neighbors classifier.

Having diabetes observations balanced, let's use them to train our model and test if it improves.

Create features $X$ and labels $y$

In [ ]:
# your code goes here
X = None
y = None
In [ ]:
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']

Split the dataset

As we now have less data to process, we will use a smaller test set that will have only 10% of the observations.

In [ ]:
from sklearn.model_selection import train_test_split

# your code goes here
In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.1,
                                                    random_state=10)

 Stantardize the features

Use the StandardScaler to standardize the features (X_train and X_test) before moving to model creation.

In [ ]:
from sklearn.preprocessing import StandardScaler

# your code goes here
In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Build and fit a k-nearest neighbors classifier

Use 10 neighbors.

For training use X_train and y_train.

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

# your code goes here
In [ ]:
from sklearn.neighbors import KNeighborsClassifier

k = 10
model = KNeighborsClassifier(n_neighbors=k)

model.fit(X_train, y_train)

Evaluating the model

Now use your model to get the predictions for the X_test set:

In [ ]:
# your code goes here
y_pred = None
In [ ]:
y_pred = model.predict(X_test)

Get the score of the model using the X_test and y_test data:

In [ ]:
# your code goes here
In [ ]:
model.score(X_test, y_test)

Get the Accuracy of your prediction:

In [ ]:
from sklearn.metrics import accuracy_score

# your code goes here
In [ ]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

green-divider

Confusion matrix

Show a confusion matrix to understand the outputs of the model.

In [ ]:
from sklearn.metrics import confusion_matrix

# your code goes here
In [ ]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

Separate the values above into tp, fn, fp and tn.

In [ ]:
# your code goes here
In [ ]:
tp, fn, fp, tn = conf_matrix.ravel()

Go ahead and manually calculate the precision and recall for "No diabetes" value.

Precision

In [ ]:
# your code goes here
In [ ]:
no_diabetes_precision = tp / (tp + fp)
no_diabetes_precision

Recall

In [ ]:
# your code goes here
In [ ]:
no_diabetes_recall = tp / (tp + fn)
no_diabetes_recall

Finally, call the classification_report method and validate precision and recall values of your model.

In [ ]:
from sklearn.metrics import classification_report

# your code goes here
model_report = None
In [ ]:
from sklearn.metrics import classification_report

model_report = classification_report(y_test, y_pred)

print('Model report: \n', model_report)

Compare the results of this project with Diabetes analysis project.

separator2

Notebooks AI
Notebooks AI Profile20060