Balancing diabetes observations¶
Now we will continue using the Diabetes dataset, which have 8 numeric features plus a 0-1 class label.
We'll analyze if the data is balanced before training our model and how are the errors that the model make.
Hands on!¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the diabetes_2.csv
dataset, and store it into diabetes_df
.¶
This file has already wrong observations removed.
# your code goes here
diabetes_df = None
diabetes_df = pd.read_csv('diabetes_2.csv')
diabetes_df.head()
Show the shape of the resulting diabetes_df
.¶
# your code goes here
diabetes_df.shape
Analyze label
distribution¶
Are observations well balanced?
How many observations we there are for 0 (no diabetes) and 1 (yes diabetes)?
# your code goes here
diabetes_df['label'].value_counts()
Show a barplot displaying these values:
# your code goes here
diabetes_df['label'].value_counts().plot(kind='bar', figsize=(14,6))
# your code goes here
no_diabetes = diabetes_df[diabetes_df['label'] == 0]
yes_diabetes = diabetes_df[diabetes_df['label'] == 1]
no_diabetes.shape, yes_diabetes.shape
Step 2¶
Resample the majority class (no diabetes) without replacement to match the number of samples of the minority class.
from sklearn.utils import resample
# your code goes here
from sklearn.utils import resample
no_diabetes_downsampled = resample(no_diabetes,
replace=False,
n_samples=yes_diabetes.shape[0],
random_state=1)
Step 3¶
Concatenate the minority class and the new re-sampled majority class.
# your code goes here
diabetes_df = pd.concat([no_diabetes_downsampled, yes_diabetes])
Step 4¶
Analyze label
distribution again to validate that your data is now balanced.
# your code goes here
diabetes_df['label'].value_counts().plot(kind='bar', figsize=(14,6))
Modeling with the balanced data¶
We will keep using a k-nearest neighbors classifier.
Having diabetes observations balanced, let's use them to train our model and test if it improves.
Create features $X$ and labels $y$¶
# your code goes here
X = None
y = None
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']
Split the dataset¶
As we now have less data to process, we will use a smaller test set that will have only 10% of the observations.
from sklearn.model_selection import train_test_split
# your code goes here
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.1,
random_state=10)
Stantardize the features¶
Use the StandardScaler
to standardize the features (X_train
and X_test
) before moving to model creation.
from sklearn.preprocessing import StandardScaler
# your code goes here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Build and fit a k-nearest neighbors classifier¶
Use 10
neighbors.
For training use X_train
and y_train
.
from sklearn.neighbors import KNeighborsClassifier
# your code goes here
from sklearn.neighbors import KNeighborsClassifier
k = 10
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
Evaluating the model¶
Now use your model to get the predictions for the X_test
set:
# your code goes here
y_pred = None
y_pred = model.predict(X_test)
Get the score
of the model using the X_test
and y_test
data:
# your code goes here
model.score(X_test, y_test)
Get the Accuracy
of your prediction:
from sklearn.metrics import accuracy_score
# your code goes here
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
from sklearn.metrics import confusion_matrix
# your code goes here
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix
Separate the values above into tp
, fn
, fp
and tn
.
# your code goes here
tp, fn, fp, tn = conf_matrix.ravel()
Go ahead and manually calculate the precision and recall for "No diabetes" value.
Precision¶
# your code goes here
no_diabetes_precision = tp / (tp + fp)
no_diabetes_precision
Recall¶
# your code goes here
no_diabetes_recall = tp / (tp + fn)
no_diabetes_recall
Finally, call the classification_report
method and validate precision and recall values of your model.
from sklearn.metrics import classification_report
# your code goes here
model_report = None
from sklearn.metrics import classification_report
model_report = classification_report(y_test, y_pred)
print('Model report: \n', model_report)
Compare the results of this project with Diabetes analysis project.