Profile picture

MLF - Cross-Validation and Parameter Tuning

Last updated: April 10th, 20202020-04-10Project preview

rmotr


Cross-validation and parameter tuning

In this lesson we will continue the machine learning application previously. In the process, we will introduce some machine learning core concepts and terms.

green-divider

 Remember the problem

Over the past few years, streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. But at the same time, the sheer amount of music on offer can mean users might be a bit overwhelmed when trying to look for newer themes that suits their tastes.

For this reason, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics.

In this lesson we'll be examining data compiled by a research group known as The Echo Nest.

Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will learn how to clean our data and do some exploratory data visualization towards the goal of feeding our data through a simple machine learning algorithm.

green-divider

Get the data and our latest model

We will get tracks data as we left it in previous lesson, and use it to train a KNeighborsClassifier model as we did before.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

tracks = pd.read_csv('tracks_3.csv')
In [2]:
tracks.head()
Out[2]:
track_id acousticness danceability energy instrumentalness liveness speechiness tempo valence genre_top genre_top_code
0 3681 0.624076 0.294289 0.856591 0.891003 0.115368 0.041453 117.741 0.250994 Rock 1
1 37637 0.628078 0.361321 0.825037 0.847185 0.196486 0.039852 139.139 0.599086 Rock 1
2 41302 0.801959 0.506456 0.688173 0.888620 0.141702 0.047079 92.209 0.653078 Rock 1
3 40948 0.846783 0.227621 0.469203 0.933570 0.080574 0.029546 80.122 0.945517 Rock 1
4 837 0.405611 0.244638 0.837132 0.725711 0.095129 0.049809 142.916 0.307880 Rock 1
In [3]:
tracks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1820 entries, 0 to 1819
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          1820 non-null   int64  
 1   acousticness      1820 non-null   float64
 2   danceability      1820 non-null   float64
 3   energy            1820 non-null   float64
 4   instrumentalness  1820 non-null   float64
 5   liveness          1820 non-null   float64
 6   speechiness       1820 non-null   float64
 7   tempo             1820 non-null   float64
 8   valence           1820 non-null   float64
 9   genre_top         1820 non-null   object 
 10  genre_top_code    1820 non-null   int64  
dtypes: float64(8), int64(2), object(1)
memory usage: 156.5+ KB

 Select Features ($X$) and Labels ($y$)

In [4]:
X = tracks.drop(['track_id', 'genre_top', 'genre_top_code'], axis=1)
y = tracks['genre_top_code']

Train and Test sets

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=10)

Data normalization

We will use the StandardScaler to standardize the features (X_train and X_test) before moving to model creation.

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Build and train the model

In [7]:
from sklearn.neighbors import KNeighborsClassifier

k=5
model = KNeighborsClassifier(n_neighbors=k)
In [8]:
model.fit(X_train, y_train)
Out[8]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Make predictions

In [9]:
y_pred = model.predict(X_test)

 Evaluate the model

In [10]:
from sklearn.metrics import classification_report

model_report = classification_report(y_test, y_pred)

print("Model report: \n", model_report)
Model report: 
               precision    recall  f1-score   support

           0       0.87      0.80      0.84       184
           1       0.81      0.88      0.84       180

    accuracy                           0.84       364
   macro avg       0.84      0.84      0.84       364
weighted avg       0.84      0.84      0.84       364

green-divider

 Cross-validation to evaluate our model

Cross validation (CV) works by defining multiple experiments to run in our sample data. It's a little bit more resource intensive, but it'll let us have a better evaluation of our parameters and model.

CV attempts to split the data multiple ways and test the model on each of the splits.


Performing cross-validation

We will use what's known as K-fold CV here which first splits the data into K different, equally sized subsets. Then, it iteratively uses each subset as a test set while using the remainder of the data as train sets.

First we define the strategy to split the dataset, selecting one of the many built-in.

In this case, we will use k-fold, with k=5 "folds" (as seen in the picture).

In [11]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=10)

We import the cross_val_score, that will compute the score for our estimator.

In [12]:
from sklearn.model_selection import cross_val_score

We now need to build the estimator with those parameters that we want to evaluate.

In [13]:
model = KNeighborsClassifier(n_neighbors=5)

We now use the entire dataset, as it'll be split internally by the cross validator.

cv=kf indicates to use the k-fold split strategy

In [14]:
X = np.concatenate((X_train, X_test))

X.shape
Out[14]:
(1820, 8)
In [15]:
y = np.concatenate((y_train, y_test))

y.shape
Out[15]:
(1820,)
In [16]:
scores = cross_val_score(model, X, y, cv=kf)

scores
Out[16]:
array([0.87087912, 0.85164835, 0.86813187, 0.85164835, 0.86813187])

Finally, we can then aggregate the results from each fold for a final model performance score.

In [17]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.86 (+/- 0.02)

Important: As cross-validation tests our model with different data, the less standard deviation we get on every cross-validation score, the more robust our model will be.


Cross-validation predictions

We can also generate cross-validated estimated for each input data point using cross-validation.

In [18]:
from sklearn.model_selection import cross_val_predict

model = KNeighborsClassifier(n_neighbors=5)

y_pred = cross_val_predict(model, X, y,
                           cv=KFold(n_splits=5, shuffle=True, random_state=10))

y_pred
Out[18]:
array([1, 0, 1, ..., 1, 0, 0])
In [19]:
model_report = classification_report(y, y_pred)

print("Model report: \n", model_report)
Model report: 
               precision    recall  f1-score   support

           0       0.88      0.84      0.86       910
           1       0.84      0.89      0.87       910

    accuracy                           0.86      1820
   macro avg       0.86      0.86      0.86      1820
weighted avg       0.86      0.86      0.86      1820

In [20]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y, y_pred)
conf_matrix
Out[20]:
array([[760, 150],
       [101, 809]])

Cross-validate

cross_validate is a similar function, but a bit more versatile and informative, than cross_val_score.

In addition to returning the scores, it returns some others metrics that may be useful, such as trained models, etc. It allows evaluating more than just one metric.

In [21]:
from sklearn.model_selection import cross_validate
In [22]:
scores = cross_validate(model, X, y, cv=kf)

scores
Out[22]:
{'fit_time': array([0.00275207, 0.00151062, 0.00141931, 0.00142884, 0.00142288]),
 'score_time': array([0.02452016, 0.0208497 , 0.02061152, 0.02489614, 0.02783132]),
 'test_score': array([0.87087912, 0.85164835, 0.86813187, 0.85164835, 0.86813187])}

The next cell prints a list of all the metrics that we can use to evaluate with cross_validate:

In [23]:
import sklearn

sorted(sklearn.metrics.SCORERS.keys())
Out[23]:
['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_weighted',
 'v_measure_score']

green-divider

 Model parameters tuning

On previous sections we trained a KNeighborsClassifier with 5 neighbors using the neighbors=5 parameter, but is this the best amount of neighbors? Can we boost our model by tunning this parameter?

All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters.

Quite often, it is not clear what the exact values of model parameters should be since they depend on the data at hand.

In [24]:
def get_kneighbors_score(k):
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X, y, cv=5)
    return "Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2)
In [25]:
get_kneighbors_score(5)
Out[25]:
'Accuracy: 0.862 (+/- 0.019)'
In [26]:
get_kneighbors_score(2)
Out[26]:
'Accuracy: 0.818 (+/- 0.029)'
In [27]:
get_kneighbors_score(15)
Out[27]:
'Accuracy: 0.865 (+/- 0.030)'
In [28]:
get_kneighbors_score(40)
Out[28]:
'Accuracy: 0.851 (+/- 0.032)'

Learning curve

Now we could visualize the relation between the hyper-parameter k and the accuracy.

In [29]:
parameters = [1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]

def get_kneighbors_score(k):
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X, y, cv=4)
    return scores.mean()

ACC_dev = []
for k in parameters:
    scores=get_kneighbors_score(k)
    ACC_dev.append(scores)
    
ACC_dev
Out[29]:
[0.8252747252747252,
 0.8576923076923078,
 0.8571428571428572,
 0.8598901098901099,
 0.8642857142857143,
 0.8598901098901099,
 0.8582417582417583,
 0.8598901098901099,
 0.8626373626373627,
 0.8587912087912088,
 0.8554945054945056,
 0.8505494505494506,
 0.8510989010989012,
 0.8478021978021979,
 0.8450549450549449,
 0.8417582417582417]

Let's plot the accuracy versus the number of neighbors:

In [30]:
f, ax = plt.subplots(figsize=(10,5))

plt.plot(parameters, ACC_dev,'o-', label='testing')
plt.axvline(x=10, ymin=0, ymax=1, color='k')
plt.xlabel('Neighbors')
plt.ylabel('Accuracy')

plt.grid()
plt.legend()
plt.plot()

print(f"Best parameters: {12}, Accuracy: {0.75}")
Best parameters: 12, Accuracy: 0.75

The learning curve (or training curve) shows the validation and training score of an estimator for varying numbers of neighbors. To complete develop the curve we need to use cross_validate.

Now we could visualize the relation between the hyper-parameter k and the accuracy of training and testing using cross_validate.

In [31]:
from sklearn.model_selection import cross_validate
knn_train_scores_mean = []
knn_train_scores_std = []
knn_test_scores_mean = []
knn_test_scores_std = []

k = np.arange(1,50,1)

for neighbors in k:
    clf = KNeighborsClassifier(n_neighbors=neighbors)
    knn_scores = cross_validate(clf, X, y, cv=5,
                                return_train_score=True, n_jobs=-1)
    
    knn_train_scores_mean.append(knn_scores['train_score'].mean())
    knn_train_scores_std.append(knn_scores['train_score'].std())
    
    knn_test_scores_mean.append(knn_scores['test_score'].mean())
    knn_test_scores_std.append(knn_scores['test_score'].std())

knn_train_scores_mean = np.array(knn_train_scores_mean)
knn_train_scores_std = np.array(knn_train_scores_std)
knn_test_scores_mean = np.array(knn_test_scores_mean)
knn_test_scores_std = np.array(knn_test_scores_std)

Plot the accuracy versus the number of neighbors

In [32]:
plt.fill_between(k, knn_train_scores_mean - knn_train_scores_std,
                 knn_train_scores_mean + knn_train_scores_std, alpha=0.1,
                 color="r")
plt.fill_between(k, knn_test_scores_mean - knn_test_scores_std,
                 knn_test_scores_mean + knn_test_scores_std, alpha=0.1, color="g")

plt.plot(k, knn_train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(k, knn_test_scores_mean, 'o-', color="g",
         label="Test score")

plt.legend()
plt.ylabel('Accuracy')
plt.xlabel('neighbors')
plt.show()

Which is the best k parameter?

In [33]:
f, ax = plt.subplots(figsize=(10,5))

plt.axvline(x=18, ymin=0, ymax=1,color='k')
plt.fill_between(k, knn_train_scores_mean - knn_train_scores_std,
                 knn_train_scores_mean + knn_train_scores_std, alpha=0.1,
                 color="r")
plt.fill_between(k, knn_test_scores_mean - knn_test_scores_std,
                 knn_test_scores_mean + knn_test_scores_std, alpha=0.1, color="g")

plt.plot(k, knn_train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(k, knn_test_scores_mean, 'o-', color="g",
         label="Test score")

plt.legend()
plt.ylabel('Accuracy')
plt.xlabel('neighbors')
plt.show()

Scikit-learn also provides tools to automatically find the best parameter combinations (via cross-validation).

purple-divider

Notebooks AI
Notebooks AI Profile20060