Credit card applications¶
In this project you will create a model to predict if an credit card application should be approved or not.
To train our model you will use the the Credit Card Approval dataset from the UCI MAchine Learning Repository.
This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
Here's the possible values for each variable:
- A1: b, a.
- A2: continuous.
- A3: continuous.
- A4: u, y, l, t.
- A5: g, p, gg.
- A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
- A7: v, h, bb, j, n, z, dd, ff, o.
- A8: continuous.
- A9: t, f.
- A10: t, f.
- A11: continuous.
- A12: t, f.
- A13: g, p, s.
- A14: continuous.
- A15: continuous.
- A16: +,- (class attribute)
Hands on!¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the credit_approval.csv
dataset, and store it into applications_df
.¶
This file already has wrong observations removed, and it is balanced.
# your code goes here
applications_df = None
applications_df = pd.read_csv('credit_approval.csv', header=None)
applications_df.head()
According to this blog the probable feature names could be
Gender
,Age
,Debt
,Married
,BankCustomer
,EducationLevel
,Ethnicity
,YearsEmployed
,PriorDefault
,Employed
,CreditScore
,DriversLicense
,Citizen
,ZipCode
,Income
andApprovalStatus
.
cols = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore',
'DriversLicence', 'Citizen', 'ZipCode', 'Income', 'ApprovalStatus']
applications_df.columns = cols
Drop unused columns¶
The DriversLicense
and ZipCode
columns are not as important as the other features for our goal of predicting whether to approve an application or not.
Let's remove them.
# your code goes here
applications_df.drop(['DriversLicence', 'ZipCode'], axis=1, inplace=True)
Show the shape of the resulting applications_df
.¶
# your code goes here
applications_df.shape
Data exploration¶
Let's first see a quick summary of the DataFrame and some descriptive statistics of the data.
# your code goes here
print(applications_df.info())
applications_df.describe()
The dataset contains both numeric and non-numeric data.
# your code goes here
applications_df.isna().sum()
Detecting incorrect values¶
Although we don't have missing values, probably there are incorrect values.
Let's check the unique values per column:
# your code goes here
for col in applications_df.columns:
print(applications_df[col].unique())
Labeled missing values¶
There are many missing values labeled with a '?
' character.
Let's replace these question marks with NaN
values.
# your code goes here
applications_df.replace('?', np.NaN, inplace=True)
Wrong column type¶
Age
column should be of type float
, fix it.
# your code goes here
applications_df = applications_df.astype({'Age': 'float'})
Handling missing values¶
If we now remove missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly.
So, to avoid this problem, we are going to impute the missing values with a mean imputation strategy.
# your code goes here
applications_df.fillna(applications_df.mean(), inplace=True)
But this mean imputation strategy only works on numeric data. So... what about the non-numeric columns?
We are going to impute these non-numeric columns with the most frequent values as present in the respective columns.
# your code goes here
for col in applications_df.columns:
if applications_df[col].dtypes == 'object':
applications_df.fillna(applications_df[col].value_counts().index[0],
inplace=True)
Finally, verify the number of NaN
s again.
# your code goes here
applications_df.isna().sum()
Numeric variables analysis¶
Let's plot histograms for each numeric variable.
First define a plot_hist
function that receives a column name as parameter and plot an histogram of that column:
def plot_hist(col):
# your code goes here
pass
def plot_hist(col):
applications_df.loc[:,col].plot(kind='hist', title=col)
plt.show()
Now use the function above to show an histogram for each numeric column.
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']
# your code goes here
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']
for col in numeric_cols:
plot_hist(col)
Now create a scatter matrix to see if there is any important relationship.
# your code goes here
from pandas.plotting import scatter_matrix
ax = scatter_matrix(applications_df[['Age', 'Debt', 'YearsEmployed',
'CreditScore', 'Income']],
figsize=(12,12))
Finally, create a correlation matrix for all the numeric variables.
# your code goes here
corr_metrics = applications_df.corr()
corr_metrics.style.background_gradient(cmap="bwr")
These numeric columns don't have strong correlation between them.
The highest one indicates that more Age
implies more YearsEmployed
that at certain point makes sense.
Non-numeric variables analysis¶
Let's plot bar plots for each non-numeric variable.
First define a plot_bar
function that receives a column name as parameter and plot a bar plot of that column:
def plot_bar(col):
# your code goes here
pass
def plot_bar(col):
applications_df.loc[:,col].value_counts().plot(kind='bar', title=col)
plt.show()
Now use the function above to show an histogram for each non-numeric column.
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen',
'ApprovalStatus']
# your code goes here
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen',
'ApprovalStatus']
for col in non_numeric_cols:
plot_bar(col)
Create features $X$ and labels $y$¶
Separate features and labels into different $X$ and $y$ variables.
# your code goes here
X = None
y = None
X = applications_df.drop(['ApprovalStatus'], axis=1)
y = applications_df['ApprovalStatus']
Convert non-numeric data into numeric¶
Let's use OrdinalEncoder
to encode categorical features ($X$) into integer values.
from sklearn.preprocessing import OrdinalEncoder
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen']
# your code goes here
from sklearn.preprocessing import OrdinalEncoder
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen']
enc = OrdinalEncoder().fit(X[non_numeric_cols])
new_values = enc.transform(X[non_numeric_cols])
X.loc[:, non_numeric_cols] = new_values
X.head()
Scale the feature values to a uniform range¶
Let's use StandardScaler
to rescale the features so that they'll have the properties of a standard normal distribution with $\mu=0$ and $\sigma=1$, where $\mu$ is the mean (average) and $\sigma$ is the standard deviation from the mean.
from sklearn.preprocessing import StandardScaler
# your code goes here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
X = scaler.transform(X)
X
Target variable analysis¶
The ApprovalStatus
is our target variable (label). It has two possible values:
y.values[0:100]
plot_bar('ApprovalStatus')
Let's use LabelEncoder
to normalize its values such that theye contain only values 0 and 1.
from sklearn.preprocessing import LabelEncoder
# your code goes here
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder().fit(y)
y = label_enc.transform(y)
y[0:100]
Modeling¶
Create a get_cv_scores
function that receives a model
parameter with a scikit-learn model and returns the CV scores of that model.
You should use a StratifiedKFold
cross-validator with 5 splits and a random_state
seed to get always the same partitions.
5 scores should be returned.
from sklearn.model_selection import StratifiedKFold, cross_val_score
def get_cv_scores(model):
# your code goes here
pass
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
def get_cv_scores(model):
return cross_val_score(model, X, y,
cv=StratifiedKFold(n_splits=5, random_state=10))
Spot-check algorithms¶
Create each of the following models and call the get_cv_scores
function using each model to get its CV scores.
Save the resulting scores in the results_df
to compare them at the end.
results_df = pd.DataFrame()
K Nearest Neighbors¶
from sklearn.neighbors import KNeighborsClassifier
# your code goes here
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
results_df['KNN'] = get_cv_scores(model)
Support Vector Machines¶
from sklearn import svm
# your code goes here
from sklearn import svm
model = svm.SVC(gamma='auto',
random_state=10)
results_df['SVM'] = get_cv_scores(model)
Naive Bayes Classifier¶
from sklearn.naive_bayes import GaussianNB
# your code goes here
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
results_df['Naive Bayes'] = get_cv_scores(model)
Gradient Boost Classifier¶
from sklearn.ensemble import GradientBoostingClassifier
# your code goes here
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=10)
results_df['GBC'] = get_cv_scores(model)
AdaBoost Classifier (Adaptive Boosting)¶
from sklearn.ensemble import AdaBoostClassifier
# your code goes here
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(random_state=10)
results_df['AdaBoost'] = get_cv_scores(model)
Present results and evaluating performance¶
Show a boxplot per algorithm using the data you saved in results_df
.
Which one performs the best? And the worst?
# your code goes here
results_df.boxplot(figsize=(14,6), grid=False)
Let's see if we can do better. We can select the best model and perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.
Finding the best performing model¶
Train severals 'KNeighborsClassifier' models with different k
values and calculate the accuracy of these models.
Keep using a KNeighborsClassifier
estimator and a StratifiedKFold
cross-validator with 5 splits.
Test the following k
values:
# your code goes here
def get_kneighbors_score(k):
None
return None
ACC_dev = []
parameters=[1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
for k in parameters:
None
def get_kneighbors_score(k):
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=5)
return scores.mean()
ACC_dev = []
parameters=[1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
for k in parameters:
scores=get_kneighbors_score(k)
ACC_dev.append(scores)
Getting the best parameters¶
# your code goes here
# This is one possible solution
ACC_dev=pd.DataFrame(ACC_dev)
ACC_dev.rename(columns={0: 'Accuracy'}, inplace=True)
ACC_dev['parameters']=parameters
ACC_dev.loc[ACC_dev['Accuracy']==ACC_dev['Accuracy'].max()]
# your code goes here
model = None
model = KNeighborsClassifier(n_neighbors=8)
Get model CV predictions¶
Generate cross-validated estimates for each input data point.
Use a StratifiedKFold
cross-validator with 5 splits and a random_state seed.
from sklearn.model_selection import cross_val_predict
y_pred = None
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(model, X, y,
cv=StratifiedKFold(n_splits=5, random_state=10))
Classification report¶
Show a classification_report
using the y_pred
predictions.
Remember that our labels were encoded as follow:
type | code |
---|---|
+ | 0 |
- | 1 |
from sklearn.metrics import classification_report
# your code goes here
from sklearn.metrics import classification_report
print(classification_report(y, y_pred))
Confusion matrix¶
Show a confusion_matrix
using the y_pred
predictions.
from sklearn.metrics import confusion_matrix
# your code goes here
from sklearn.metrics import confusion_matrix
confusion_matrix(y, y_pred, labels=[0, 1])
The first element of the of the first row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.
The last element of the second row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly.