Profile picture

MLF - Diabetes Analysis

Last updated: July 11th, 20202020-07-11Project preview

rmotr


Diabetes analysis

Now we will put in practice what we just learn on previous lessons.

Our final goal will be creating a model to predict whether a person has diabetes or not, based on information about the patient such as blood pressure, body mass index (BMI), age, etc.

We will use Diabetes dataset, which have 8 numeric features plus a 0-1 class label.

  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skin fold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)^2)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0:No or 1:Yes)

separator2

Hands on!

 import libraries

In [ ]:
# your code goes here

%matplotlib inline
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

 Load the data/diabetes.csv file, and store it into diabetes_df dataframe.

In [ ]:
# your code goes here

diabetes_df = None
In [ ]:
diabetes_df = pd.read_csv('data/diabetes.csv', sep=';')

diabetes_df.head()

Show the shape of the resulting diabetes_df.

In [ ]:
# your code goes here
In [ ]:
diabetes_df.shape

green-divider

Data exploration, visualization and relationships

Let's first see some descriptive statistics of the data:

In [ ]:
# your code goes here
In [ ]:
diabetes_df.describe()

Provide the information about the data types,columns, null value counts, memory usage etc

In [ ]:
# your code goes here
In [ ]:
diabetes_df.info(verbose=True)

Show the count of zeros per column

In [ ]:
# your code goes here
In [ ]:
diabetes_df.isin([0]).sum()

Do you see something wrong?

green-divider

Remove wrong observations

Remove patients with bloodPress, glucose or massIndex equal to 0.

In [ ]:
# your code goes here
In [ ]:
diabetes_df = diabetes_df.loc[diabetes_df['bloodPress'] > 0, :]
diabetes_df = diabetes_df.loc[diabetes_df['glucose'] > 0, :]
diabetes_df = diabetes_df.loc[diabetes_df['massIndex'] > 0, :]

diabetes_df.shape

Let's see again the descriptive statistics of the data:

In [ ]:
# your code goes here
In [ ]:
diabetes_df.describe()

green-divider

Plot a scatter_matrix showing age, glucose and massIndex relationships

You can also color each point with label values indicating if the patient has diabetes or not.

In [ ]:
# your code goes here
In [ ]:
from pandas.plotting import scatter_matrix

ax = scatter_matrix(diabetes_df[['age', 'glucose', 'massIndex']],
                    c=diabetes_df['label'],
                    cmap=plt.cm.Spectral,
                    figsize=(12,12))

plt.legend([plt.plot([],[],color=plt.get_cmap('Spectral')(i/1.),
                     ls='', marker='o', markersize=10)[0] for i in range(2)],
           ['No diabetes', 'Yes diabetes'],
           loc=(1.03, 2.84))

Can you see any insight?

green-divider

Let's see correlation matrix for all the variables

Which variables are positively and negatively correlated? Make an analysis of the correlation matrix.

In [ ]:
# your code goes here
In [ ]:
corr_metrics = diabetes_df.corr()

corr_metrics.style.background_gradient(cmap="bwr")

age and numPregnant variables have high correlation, which makes sense.

green-divider

Create features $X$ and labels $y$

In [ ]:
# your code goes here
X = None
y = None
In [ ]:
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']

green-divider

Split the dataset

Test set should have 20% of the observations.

In [ ]:
from sklearn.model_selection import train_test_split

# your code goes here
In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=10)

green-divider

Standardize the features

Use the StandardScaler to standardize the features (X_train and X_test) before moving to model creation.

In [ ]:
from sklearn.preprocessing import StandardScaler

# your code goes here
In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

green-divider

Build and fit a k-nearest neighbors classifier

Use 4 neighbors.

For training use X_train and y_train.

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

# your code goes here
In [ ]:
from sklearn.neighbors import KNeighborsClassifier

k = 4
model = KNeighborsClassifier(n_neighbors=k)

model.fit(X_train, y_train)

green-divider

Evaluating the model

Now use your model to get the predictions for the X_test set:

In [ ]:
# your code goes here
y_pred = None
In [ ]:
y_pred = model.predict(X_test)

y_pred

Get the Accuracy of your prediction:

In [ ]:
from sklearn.metrics import accuracy_score

# your code goes here
In [ ]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

accuracy

Finally, create a full model report using classification_report method:

In [ ]:
from sklearn.metrics import classification_report

# your code goes here
model_report = None
In [ ]:
from sklearn.metrics import classification_report

model_report = classification_report(y_test, y_pred)

print('Model report: \n', model_report)

What is your model accuracy?

separator2

Notebooks AI
Notebooks AI Profile20060