MLF - Credit Card Applications

Last updated: July 22nd, 2020

Credit card applications¶

In this project you will create a model to predict if an credit card application should be approved or not.

To train our model you will use the the Credit Card Approval dataset from the UCI MAchine Learning Repository.

This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.

Here's the possible values for each variable:

• A1: b, a.
• A2: continuous.
• A3: continuous.
• A4: u, y, l, t.
• A5: g, p, gg.
• A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
• A7: v, h, bb, j, n, z, dd, ff, o.
• A8: continuous.
• A9: t, f.
• A10: t, f.
• A11: continuous.
• A12: t, f.
• A13: g, p, s.
• A14: continuous.
• A15: continuous.
• A16: +,- (class attribute)

Hands on!¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline


Load the credit_approval.csv dataset, and store it into applications_df.¶

This file already has wrong observations removed, and it is balanced.

In [ ]:
# your code goes here
applications_df = None

In [ ]:
applications_df = pd.read_csv('credit_approval.csv', header=None)



According to this blog the probable feature names could be Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and ApprovalStatus.

In [ ]:
cols = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore',
'DriversLicence', 'Citizen', 'ZipCode', 'Income', 'ApprovalStatus']

applications_df.columns = cols


Drop unused columns¶

The DriversLicense and ZipCode columns are not as important as the other features for our goal of predicting whether to approve an application or not.

Let's remove them.

In [ ]:
# your code goes here

In [ ]:
applications_df.drop(['DriversLicence', 'ZipCode'], axis=1, inplace=True)


Show the shape of the resulting applications_df.¶

In [ ]:
# your code goes here

In [ ]:
applications_df.shape


Data exploration¶

Let's first see a quick summary of the DataFrame and some descriptive statistics of the data.

In [ ]:
# your code goes here

In [ ]:
print(applications_df.info())

applications_df.describe()


The dataset contains both numeric and non-numeric data.

Detecting missing values¶

Check per column if there is any missing value.

In [ ]:
# your code goes here

In [ ]:
applications_df.isna().sum()


Detecting incorrect values¶

Although we don't have missing values, probably there are incorrect values.

Let's check the unique values per column:

In [ ]:
# your code goes here

In [ ]:
for col in applications_df.columns:
print(applications_df[col].unique())


Labeled missing values¶

There are many missing values labeled with a '?' character.

Let's replace these question marks with NaN values.

In [ ]:
# your code goes here

In [ ]:
applications_df.replace('?', np.NaN, inplace=True)


Wrong column type¶

Age column should be of type float, fix it.

In [ ]:
# your code goes here

In [ ]:
applications_df = applications_df.astype({'Age': 'float'})


Handling missing values¶

If we now remove missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly.

So, to avoid this problem, we are going to impute the missing values with a mean imputation strategy.

In [ ]:
# your code goes here

In [ ]:
applications_df.fillna(applications_df.mean(), inplace=True)


But this mean imputation strategy only works on numeric data. So... what about the non-numeric columns?

We are going to impute these non-numeric columns with the most frequent values as present in the respective columns.

In [ ]:
# your code goes here

In [ ]:
for col in applications_df.columns:
if applications_df[col].dtypes == 'object':
applications_df.fillna(applications_df[col].value_counts().index[0],
inplace=True)


Finally, verify the number of NaNs again.

In [ ]:
# your code goes here

In [ ]:
applications_df.isna().sum()


Numeric variables analysis¶

Let's plot histograms for each numeric variable.

First define a plot_hist function that receives a column name as parameter and plot an histogram of that column:

In [ ]:
def plot_hist(col):
pass

In [ ]:
def plot_hist(col):
applications_df.loc[:,col].plot(kind='hist', title=col)
plt.show()


Now use the function above to show an histogram for each numeric column.

In [ ]:
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']


In [ ]:
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']

for col in numeric_cols:
plot_hist(col)


Now create a scatter matrix to see if there is any important relationship.

In [ ]:
# your code goes here

In [ ]:
from pandas.plotting import scatter_matrix

ax = scatter_matrix(applications_df[['Age', 'Debt', 'YearsEmployed',
'CreditScore', 'Income']],
figsize=(12,12))


Finally, create a correlation matrix for all the numeric variables.

In [ ]:
# your code goes here

In [ ]:
corr_metrics = applications_df.corr()



These numeric columns don't have strong correlation between them.

The highest one indicates that more Age implies more YearsEmployed that at certain point makes sense.

Non-numeric variables analysis¶

Let's plot bar plots for each non-numeric variable.

First define a plot_bar function that receives a column name as parameter and plot a bar plot of that column:

In [ ]:
def plot_bar(col):
pass

In [ ]:
def plot_bar(col):
applications_df.loc[:,col].value_counts().plot(kind='bar', title=col)
plt.show()


Now use the function above to show an histogram for each non-numeric column.

In [ ]:
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen',
'ApprovalStatus']


In [ ]:
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen',
'ApprovalStatus']

for col in non_numeric_cols:
plot_bar(col)


Create features $X$ and labels $y$¶

Separate features and labels into different $X$ and $y$ variables.

In [ ]:
# your code goes here
X = None
y = None

In [ ]:
X = applications_df.drop(['ApprovalStatus'], axis=1)
y = applications_df['ApprovalStatus']


Convert non-numeric data into numeric¶

Let's use OrdinalEncoder to encode categorical features ($X$) into integer values.

In [ ]:
from sklearn.preprocessing import OrdinalEncoder

non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen']


In [ ]:
from sklearn.preprocessing import OrdinalEncoder

non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
'Ethnicity', 'PriorDefault', 'Employed', 'Citizen']

enc = OrdinalEncoder().fit(X[non_numeric_cols])

new_values = enc.transform(X[non_numeric_cols])

X.loc[:, non_numeric_cols] = new_values



Scale the feature values to a uniform range¶

Let's use StandardScaler to rescale the features so that they'll have the properties of a standard normal distribution with $\mu=0$ and $\sigma=1$, where $\mu$ is the mean (average) and $\sigma$ is the standard deviation from the mean.

In [ ]:
from sklearn.preprocessing import StandardScaler


In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)

X = scaler.transform(X)

X


Target variable analysis¶

The ApprovalStatus is our target variable (label). It has two possible values:

In [ ]:
y.values[0:100]

In [ ]:
plot_bar('ApprovalStatus')


Let's use LabelEncoder to normalize its values such that theye contain only values 0 and 1.

In [ ]:
from sklearn.preprocessing import LabelEncoder


In [ ]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder().fit(y)

y = label_enc.transform(y)

y[0:100]


Modeling¶

Create a get_cv_scores function that receives a model parameter with a scikit-learn model and returns the CV scores of that model.

You should use a StratifiedKFold cross-validator with 5 splits and a random_state seed to get always the same partitions.

5 scores should be returned.

In [ ]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

def get_cv_scores(model):
pass

In [ ]:
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score

def get_cv_scores(model):
return cross_val_score(model, X, y,
cv=StratifiedKFold(n_splits=5, random_state=10))


Spot-check algorithms¶

Create each of the following models and call the get_cv_scores function using each model to get its CV scores.

Save the resulting scores in the results_df to compare them at the end.

In [ ]:
results_df = pd.DataFrame()


K Nearest Neighbors¶

In [ ]:
from sklearn.neighbors import KNeighborsClassifier


In [ ]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

results_df['KNN'] = get_cv_scores(model)


Support Vector Machines¶

In [ ]:
from sklearn import svm


In [ ]:
from sklearn import svm

model = svm.SVC(gamma='auto',
random_state=10)

results_df['SVM'] = get_cv_scores(model)


Naive Bayes Classifier¶

In [ ]:
from sklearn.naive_bayes import GaussianNB


In [ ]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

results_df['Naive Bayes'] = get_cv_scores(model)


In [ ]:
from sklearn.ensemble import GradientBoostingClassifier


In [ ]:
from sklearn.ensemble import GradientBoostingClassifier

results_df['GBC'] = get_cv_scores(model)


In [ ]:
from sklearn.ensemble import AdaBoostClassifier


In [ ]:
from sklearn.ensemble import AdaBoostClassifier



Present results and evaluating performance¶

Show a boxplot per algorithm using the data you saved in results_df.

Which one performs the best? And the worst?

In [ ]:
# your code goes here

In [ ]:
results_df.boxplot(figsize=(14,6), grid=False)


Let's see if we can do better. We can select the best model and perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

Finding the best performing model¶

Train severals 'KNeighborsClassifier' models with different k values and calculate the accuracy of these models.

Keep using a KNeighborsClassifier estimator and a StratifiedKFold cross-validator with 5 splits.

Test the following k values:

In [ ]:
# your code goes here

def get_kneighbors_score(k):
None
return None

ACC_dev = []
parameters=[1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]

for k in parameters:
None

In [ ]:
def get_kneighbors_score(k):
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=5)
return scores.mean()

ACC_dev = []
parameters=[1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]

for k in parameters:
scores=get_kneighbors_score(k)
ACC_dev.append(scores)


Getting the best parameters¶

In [ ]:
# your code goes here

In [ ]:
# This is one possible solution
ACC_dev=pd.DataFrame(ACC_dev)
ACC_dev.rename(columns={0: 'Accuracy'}, inplace=True)
ACC_dev['parameters']=parameters

ACC_dev.loc[ACC_dev['Accuracy']==ACC_dev['Accuracy'].max()]


Evaluating our final model¶

Create the final model, with the tunned parameter.

In [ ]:
# your code goes here

model = None

In [ ]:
model = KNeighborsClassifier(n_neighbors=8)


Get model CV predictions¶

Generate cross-validated estimates for each input data point.

Use a StratifiedKFold cross-validator with 5 splits and a random_state seed.

In [ ]:
from sklearn.model_selection import cross_val_predict

y_pred = None

In [ ]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(model, X, y,
cv=StratifiedKFold(n_splits=5, random_state=10))


Classification report¶

Show a classification_report using the y_pred predictions.

Remember that our labels were encoded as follow:

type code
+ 0
- 1
In [ ]:
from sklearn.metrics import classification_report


In [ ]:
from sklearn.metrics import classification_report

print(classification_report(y, y_pred))


Confusion matrix¶

Show a confusion_matrix using the y_pred predictions.

In [ ]:
from sklearn.metrics import confusion_matrix


from sklearn.metrics import confusion_matrix