# MLF - Diabetes Analysis

Last updated: July 11th, 2020

# Diabetes analysis¶

Now we will put in practice what we just learn on previous lessons.

Our final goal will be creating a model to predict whether a person has diabetes or not, based on information about the patient such as blood pressure, body mass index (BMI), age, etc.

We will use Diabetes dataset, which have 8 numeric features plus a 0-1 class label.

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0:No or 1:Yes)

### import libraries¶

In [ ]:
# your code goes here

%matplotlib inline

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline


### Load the data/diabetes.csv file, and store it into diabetes_df dataframe.¶

In [ ]:
# your code goes here

diabetes_df = None

In [ ]:
diabetes_df = pd.read_csv('data/diabetes.csv', sep=';')



Show the shape of the resulting diabetes_df.

In [ ]:
# your code goes here

In [ ]:
diabetes_df.shape


## Data exploration, visualization and relationships¶

Let's first see some descriptive statistics of the data:

In [ ]:
# your code goes here

In [ ]:
diabetes_df.describe()


Provide the information about the data types,columns, null value counts, memory usage etc

In [ ]:
# your code goes here

In [ ]:
diabetes_df.info(verbose=True)


Show the count of zeros per column

In [ ]:
# your code goes here

In [ ]:
diabetes_df.isin([0]).sum()


Do you see something wrong?

### Remove wrong observations¶

Remove patients with bloodPress, glucose or massIndex equal to 0.

In [ ]:
# your code goes here

In [ ]:
diabetes_df = diabetes_df.loc[diabetes_df['bloodPress'] > 0, :]
diabetes_df = diabetes_df.loc[diabetes_df['glucose'] > 0, :]
diabetes_df = diabetes_df.loc[diabetes_df['massIndex'] > 0, :]

diabetes_df.shape


Let's see again the descriptive statistics of the data:

In [ ]:
# your code goes here

In [ ]:
diabetes_df.describe()


### Plot a scatter_matrix showing age, glucose and massIndex relationships¶

You can also color each point with label values indicating if the patient has diabetes or not.

In [ ]:
# your code goes here

In [ ]:
from pandas.plotting import scatter_matrix

ax = scatter_matrix(diabetes_df[['age', 'glucose', 'massIndex']],
c=diabetes_df['label'],
cmap=plt.cm.Spectral,
figsize=(12,12))

plt.legend([plt.plot([],[],color=plt.get_cmap('Spectral')(i/1.),
ls='', marker='o', markersize=10)[0] for i in range(2)],
['No diabetes', 'Yes diabetes'],
loc=(1.03, 2.84))


Can you see any insight?

### Let's see correlation matrix for all the variables¶

Which variables are positively and negatively correlated? Make an analysis of the correlation matrix.

In [ ]:
# your code goes here

In [ ]:
corr_metrics = diabetes_df.corr()



age and numPregnant variables have high correlation, which makes sense.

### Create features $X$ and labels $y$¶

In [ ]:
# your code goes here
X = None
y = None

In [ ]:
X = diabetes_df.drop(['label'], axis=1)
y = diabetes_df['label']


### Split the dataset¶

Test set should have 20% of the observations.

In [ ]:
from sklearn.model_selection import train_test_split


In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=10)


### Standardize the features¶

Use the StandardScaler to standardize the features (X_train and X_test) before moving to model creation.

In [ ]:
from sklearn.preprocessing import StandardScaler


In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


### Build and fit a k-nearest neighbors classifier¶

Use 4 neighbors.

For training use X_train and y_train.

In [ ]:
from sklearn.neighbors import KNeighborsClassifier


In [ ]:
from sklearn.neighbors import KNeighborsClassifier

k = 4
model = KNeighborsClassifier(n_neighbors=k)

model.fit(X_train, y_train)


### Evaluating the model¶

Now use your model to get the predictions for the X_test set:

In [ ]:
# your code goes here
y_pred = None

In [ ]:
y_pred = model.predict(X_test)

y_pred


Get the Accuracy of your prediction:

In [ ]:
from sklearn.metrics import accuracy_score


In [ ]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

accuracy


Finally, create a full model report using classification_report method:

In [ ]:
from sklearn.metrics import classification_report


from sklearn.metrics import classification_report