MLF - Credit Card Applications

Last updated: July 22nd, 20202020-07-22Project preview

rmotr


Credit card applications

In this project you will create a model to predict if an credit card application should be approved or not.

To train our model you will use the the Credit Card Approval dataset from the UCI MAchine Learning Repository.

This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.

Here's the possible values for each variable:

  • A1: b, a.
  • A2: continuous.
  • A3: continuous.
  • A4: u, y, l, t.
  • A5: g, p, gg.
  • A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
  • A7: v, h, bb, j, n, z, dd, ff, o.
  • A8: continuous.
  • A9: t, f.
  • A10: t, f.
  • A11: continuous.
  • A12: t, f.
  • A13: g, p, s.
  • A14: continuous.
  • A15: continuous.
  • A16: +,- (class attribute)

separator2

Hands on!

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

 Load the credit_approval.csv dataset, and store it into applications_df.

This file already has wrong observations removed, and it is balanced.

In [ ]:
# your code goes here
applications_df = None
In [ ]:
applications_df = pd.read_csv('credit_approval.csv', header=None)

applications_df.head()

According to this blog the probable feature names could be Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and ApprovalStatus.

In [ ]:
cols = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel',
        'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore',
        'DriversLicence', 'Citizen', 'ZipCode', 'Income', 'ApprovalStatus']

applications_df.columns = cols

Drop unused columns

The DriversLicense and ZipCode columns are not as important as the other features for our goal of predicting whether to approve an application or not.

Let's remove them.

In [ ]:
# your code goes here
In [ ]:
applications_df.drop(['DriversLicence', 'ZipCode'], axis=1, inplace=True)

Show the shape of the resulting applications_df.

In [ ]:
# your code goes here
In [ ]:
applications_df.shape

green-divider

Data exploration

Let's first see a quick summary of the DataFrame and some descriptive statistics of the data.

In [ ]:
# your code goes here
In [ ]:
print(applications_df.info())

applications_df.describe()

The dataset contains both numeric and non-numeric data.

green-divider

Detecting missing values

Check per column if there is any missing value.

In [ ]:
# your code goes here
In [ ]:
applications_df.isna().sum()

green-divider

Detecting incorrect values

Although we don't have missing values, probably there are incorrect values.

Let's check the unique values per column:

In [ ]:
# your code goes here
In [ ]:
for col in applications_df.columns:
    print(applications_df[col].unique())

Labeled missing values

There are many missing values labeled with a '?' character.

Let's replace these question marks with NaN values.

In [ ]:
# your code goes here
In [ ]:
applications_df.replace('?', np.NaN, inplace=True)

Wrong column type

Age column should be of type float, fix it.

In [ ]:
# your code goes here
In [ ]:
applications_df = applications_df.astype({'Age': 'float'})

green-divider

Handling missing values

If we now remove missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly.

So, to avoid this problem, we are going to impute the missing values with a mean imputation strategy.

In [ ]:
# your code goes here
In [ ]:
applications_df.fillna(applications_df.mean(), inplace=True)

But this mean imputation strategy only works on numeric data. So... what about the non-numeric columns?

We are going to impute these non-numeric columns with the most frequent values as present in the respective columns.

In [ ]:
# your code goes here
In [ ]:
for col in applications_df.columns:
    if applications_df[col].dtypes == 'object':
        applications_df.fillna(applications_df[col].value_counts().index[0],
                               inplace=True)

Finally, verify the number of NaNs again.

In [ ]:
# your code goes here
In [ ]:
applications_df.isna().sum()

green-divider

Numeric variables analysis

Let's plot histograms for each numeric variable.

First define a plot_hist function that receives a column name as parameter and plot an histogram of that column:

In [ ]:
def plot_hist(col):
    # your code goes here
    pass
In [ ]:
def plot_hist(col):
    applications_df.loc[:,col].plot(kind='hist', title=col)
    plt.show()

Now use the function above to show an histogram for each numeric column.

In [ ]:
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']

# your code goes here
In [ ]:
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']

for col in numeric_cols:
    plot_hist(col)

Now create a scatter matrix to see if there is any important relationship.

In [ ]:
# your code goes here
In [ ]:
from pandas.plotting import scatter_matrix

ax = scatter_matrix(applications_df[['Age', 'Debt', 'YearsEmployed',
                                     'CreditScore', 'Income']],
                    figsize=(12,12))

Finally, create a correlation matrix for all the numeric variables.

In [ ]:
# your code goes here
In [ ]:
corr_metrics = applications_df.corr()

corr_metrics.style.background_gradient(cmap="bwr")

These numeric columns don't have strong correlation between them.

The highest one indicates that more Age implies more YearsEmployed that at certain point makes sense.

green-divider

Non-numeric variables analysis

Let's plot bar plots for each non-numeric variable.

First define a plot_bar function that receives a column name as parameter and plot a bar plot of that column:

In [ ]:
def plot_bar(col):
    # your code goes here
    pass
In [ ]:
def plot_bar(col):
    applications_df.loc[:,col].value_counts().plot(kind='bar', title=col)
    plt.show()

Now use the function above to show an histogram for each non-numeric column.

In [ ]:
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
                    'Ethnicity', 'PriorDefault', 'Employed', 'Citizen',
                    'ApprovalStatus']

# your code goes here
In [ ]:
non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
                    'Ethnicity', 'PriorDefault', 'Employed', 'Citizen',
                    'ApprovalStatus']

for col in non_numeric_cols:
    plot_bar(col)

green-divider

Create features $X$ and labels $y$

Separate features and labels into different $X$ and $y$ variables.

In [ ]:
# your code goes here
X = None
y = None
In [ ]:
X = applications_df.drop(['ApprovalStatus'], axis=1)
y = applications_df['ApprovalStatus']

green-divider

Convert non-numeric data into numeric

Let's use OrdinalEncoder to encode categorical features ($X$) into integer values.

In [ ]:
from sklearn.preprocessing import OrdinalEncoder

non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
                    'Ethnicity', 'PriorDefault', 'Employed', 'Citizen']

# your code goes here
In [ ]:
from sklearn.preprocessing import OrdinalEncoder

non_numeric_cols = ['Gender', 'Married', 'BankCustomer', 'EducationLevel',
                    'Ethnicity', 'PriorDefault', 'Employed', 'Citizen']

enc = OrdinalEncoder().fit(X[non_numeric_cols])

new_values = enc.transform(X[non_numeric_cols])

X.loc[:, non_numeric_cols] = new_values

X.head()

green-divider

Scale the feature values to a uniform range

Let's use StandardScaler to rescale the features so that they'll have the properties of a standard normal distribution with $\mu=0$ and $\sigma=1$, where $\mu$ is the mean (average) and $\sigma$ is the standard deviation from the mean.

In [ ]:
from sklearn.preprocessing import StandardScaler

# your code goes here
In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)

X = scaler.transform(X)

X

green-divider

Target variable analysis

The ApprovalStatus is our target variable (label). It has two possible values:

In [ ]:
y.values[0:100]
In [ ]:
plot_bar('ApprovalStatus')

Let's use LabelEncoder to normalize its values such that theye contain only values 0 and 1.

In [ ]:
from sklearn.preprocessing import LabelEncoder

# your code goes here
In [ ]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder().fit(y)

y = label_enc.transform(y)

y[0:100]

green-divider

 Modeling

Create a get_cv_scores function that receives a model parameter with a scikit-learn model and returns the CV scores of that model.

You should use a StratifiedKFold cross-validator with 5 splits and a random_state seed to get always the same partitions.

5 scores should be returned.

In [ ]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

def get_cv_scores(model):
    # your code goes here
    pass
In [ ]:
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score

def get_cv_scores(model):
    return cross_val_score(model, X, y,
                           cv=StratifiedKFold(n_splits=5, random_state=10))

green-divider

Spot-check algorithms

Create each of the following models and call the get_cv_scores function using each model to get its CV scores.

Save the resulting scores in the results_df to compare them at the end.

In [ ]:
results_df = pd.DataFrame()

K Nearest Neighbors

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

# your code goes here
In [ ]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

results_df['KNN'] = get_cv_scores(model)

Support Vector Machines

In [ ]:
from sklearn import svm

# your code goes here
In [ ]:
from sklearn import svm

model = svm.SVC(gamma='auto',
                random_state=10)

results_df['SVM'] = get_cv_scores(model)

Naive Bayes Classifier

In [ ]:
from sklearn.naive_bayes import GaussianNB

# your code goes here
In [ ]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

results_df['Naive Bayes'] = get_cv_scores(model)

 Gradient Boost Classifier

In [ ]:
from sklearn.ensemble import GradientBoostingClassifier

# your code goes here
In [ ]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=10)

results_df['GBC'] = get_cv_scores(model)

AdaBoost Classifier (Adaptive Boosting)

In [ ]:
from sklearn.ensemble import AdaBoostClassifier

# your code goes here
In [ ]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state=10)

results_df['AdaBoost'] = get_cv_scores(model)

green-divider

Present results and evaluating performance

Show a boxplot per algorithm using the data you saved in results_df.

Which one performs the best? And the worst?

In [ ]:
# your code goes here
In [ ]:
results_df.boxplot(figsize=(14,6), grid=False)

Let's see if we can do better. We can select the best model and perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

green-divider

Finding the best performing model

Train severals 'KNeighborsClassifier' models with different k values and calculate the accuracy of these models.

Keep using a KNeighborsClassifier estimator and a StratifiedKFold cross-validator with 5 splits.

Test the following k values:

In [ ]:
# your code goes here

def get_kneighbors_score(k):
    None
    return None

ACC_dev = []
parameters=[1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
    
for k in parameters:
    None
In [ ]:
def get_kneighbors_score(k):
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X, y, cv=5)
    return scores.mean()

ACC_dev = []
parameters=[1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
    
for k in parameters:
    scores=get_kneighbors_score(k)
    ACC_dev.append(scores)

Getting the best parameters

In [ ]:
# your code goes here
In [ ]:
# This is one possible solution
ACC_dev=pd.DataFrame(ACC_dev)
ACC_dev.rename(columns={0: 'Accuracy'}, inplace=True)
ACC_dev['parameters']=parameters

ACC_dev.loc[ACC_dev['Accuracy']==ACC_dev['Accuracy'].max()]

green-divider

Evaluating our final model

Create the final model, with the tunned parameter.

In [ ]:
# your code goes here

model = None
In [ ]:
model = KNeighborsClassifier(n_neighbors=8)

Get model CV predictions

Generate cross-validated estimates for each input data point.

Use a StratifiedKFold cross-validator with 5 splits and a random_state seed.

In [ ]:
from sklearn.model_selection import cross_val_predict

y_pred = None
In [ ]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(model, X, y,
                           cv=StratifiedKFold(n_splits=5, random_state=10))

Classification report

Show a classification_report using the y_pred predictions.

Remember that our labels were encoded as follow:

type code
+ 0
- 1
In [ ]:
from sklearn.metrics import classification_report

# your code goes here
In [ ]:
from sklearn.metrics import classification_report

print(classification_report(y, y_pred))

 Confusion matrix

Show a confusion_matrix using the y_pred predictions.

In [ ]:
from sklearn.metrics import confusion_matrix

# your code goes here
In [ ]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y, y_pred, labels=[0, 1])

The first element of the of the first row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

The last element of the second row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly.

purple-divider

Notebooks AI
Notebooks AI Profile20060