Diabetes analysis¶
Now we will put in practice what we just learn on previous lessons.
Our final goal will be creating a model to predict whether a person has diabetes or not, based on information about the patient such as blood pressure, body mass index (BMI), age, etc.
We will use Diabetes dataset, which have 8 numeric features plus a 0-1 class label.
- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Class variable (0:No or 1:Yes)
# your code goes here
%matplotlib inline
# Solution
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the diabetes.csv
dataset, and store it into diabetes_df
.¶
# your code goes here
diabetes_df = None
diabetes_df = pd.read_csv('diabetes.csv', sep=';')
diabetes_df.head()
Show the shape of the resulting diabetes_df
.¶
# your code goes here
diabetes_df.shape
Data exploration, visualization and relationships¶
Let's first see some descriptive statistics of the data:
# your code goes here
diabetes_df.describe()
Provide the information about the data types,columns, null value counts, memory usage etc
# your code goes here
## solution
diabetes_df.info(verbose=True)
Show the count of zeros per column
# your code goes here
## solutions
print(diabetes_df.isin([0]).sum())
Do you see something wrong?
Remove wrong observations¶
Remove patients with bloodPress
, glucose
or massIndex
equal to 0
.
# your code goes here
diabetes_df = diabetes_df.loc[diabetes_df['bloodPress'] > 0, :]
diabetes_df = diabetes_df.loc[diabetes_df['glucose'] > 0, :]
diabetes_df = diabetes_df.loc[diabetes_df['massIndex'] > 0, :]
diabetes_df.shape
Let's see again the descriptive statistics of the data:
# your code goes here
# solution
diabetes_df.describe()
Plot a scatter_matrix
showing age
, glucose
and massIndex
relationships¶
You can also color each point with label
values indicating if the patient has diabetes or not.
# your code goes here
from pandas.plotting import scatter_matrix
ax = scatter_matrix(diabetes_df[['age', 'glucose', 'massIndex']],
c=diabetes_df['label'],
cmap=plt.cm.Spectral,
figsize=(12,12))
plt.legend([plt.plot([],[],color=plt.get_cmap('Spectral')(i/1.),
ls='', marker='o', markersize=10)[0] for i in range(2)],
['No diabetes', 'Yes diabetes'],
loc=(1.03, 2.84))
Can you see any insight?
Let's see correlation matrix for all the variables¶
Which variables are positively and negatively correlated? Make an analysis of the correlation matrix.
# your code goes here
corr_metrics = diabetes_df.corr()
corr_metrics.style.background_gradient(cmap="bwr")
age
and numPregnant
variables have high correlation, which makes sense.
Create features $X$ and labels $y$¶
# your code goes here
X = None
y = None
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']
from sklearn.model_selection import train_test_split
# your code goes here
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=10)
Standardize the features¶
Use the StandardScaler
to standardize the features (X_train
and X_test
) before moving to model creation.
from sklearn.preprocessing import StandardScaler
# your code goes here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Build and fit a k-nearest neighbors classifier¶
Use 4
neighbors.
For training use X_train
and y_train
.
from sklearn.neighbors import KNeighborsClassifier
# your code goes here
from sklearn.neighbors import KNeighborsClassifier
k = 4
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
# your code goes here
y_pred = None
y_pred = model.predict(X_test)
Get the Accuracy
of your prediction:
from sklearn.metrics import accuracy_score
# your code goes here
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
Finally, create a full model report using classification_report
method:
from sklearn.metrics import classification_report
# your code goes here
model_report = None
from sklearn.metrics import classification_report
model_report = classification_report(y_test, y_pred)
print('Model report: \n', model_report)
What is your model accuracy?