Cross-validation and parameter tuning¶
In this lesson we will continue the machine learning application previously. In the process, we will introduce some machine learning core concepts and terms.
Remember the problem¶
Over the past few years, streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. But at the same time, the sheer amount of music on offer can mean users might be a bit overwhelmed when trying to look for newer themes that suits their tastes.
For this reason, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics.
In this lesson we'll be examining data compiled by a research group known as The Echo Nest.
Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will learn how to clean our data and do some exploratory data visualization towards the goal of feeding our data through a simple machine learning algorithm.
Get the data and our latest model¶
We will get tracks
data as we left it in previous lesson, and use it to train a KNeighborsClassifier
model as we did before.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
tracks = pd.read_csv('tracks_3.csv')
tracks.head()
tracks.info()
Select Features ($X$) and Labels ($y$)¶
X = tracks.drop(['track_id', 'genre_top', 'genre_top_code'], axis=1)
y = tracks['genre_top_code']
Train and Test sets¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=10)
Data normalization¶
We will use the StandardScaler
to standardize the features (X_train
and X_test
) before moving to model creation.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Build and train the model¶
from sklearn.neighbors import KNeighborsClassifier
k=5
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
Make predictions¶
y_pred = model.predict(X_test)
Evaluate the model¶
from sklearn.metrics import classification_report
model_report = classification_report(y_test, y_pred)
print("Model report: \n", model_report)
Cross-validation to evaluate our model¶
Cross validation (CV) works by defining multiple experiments to run in our sample data. It's a little bit more resource intensive, but it'll let us have a better evaluation of our parameters and model.
CV attempts to split the data multiple ways and test the model on each of the splits.
Performing cross-validation¶
We will use what's known as K-fold CV here which first splits the data into K different, equally sized subsets. Then, it iteratively uses each subset as a test set while using the remainder of the data as train sets.
First we define the strategy to split the dataset, selecting one of the many built-in.
In this case, we will use k-fold
, with k=5
"folds" (as seen in the picture).
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=10)
We import the cross_val_score
, that will compute the score for our estimator.
from sklearn.model_selection import cross_val_score
We now need to build the estimator with those parameters that we want to evaluate.
model = KNeighborsClassifier(n_neighbors=5)
We now use the entire dataset, as it'll be split internally by the cross validator.
cv=kf
indicates to use the k-fold
split strategy
X = np.concatenate((X_train, X_test))
X.shape
y = np.concatenate((y_train, y_test))
y.shape
scores = cross_val_score(model, X, y, cv=kf)
scores
Finally, we can then aggregate the results from each fold for a final model performance score.
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Important: As cross-validation tests our model with different data, the less standard deviation we get on every cross-validation score, the more robust our model will be.
Cross-validation predictions¶
We can also generate cross-validated estimated for each input data point using cross-validation.
from sklearn.model_selection import cross_val_predict
model = KNeighborsClassifier(n_neighbors=5)
y_pred = cross_val_predict(model, X, y,
cv=KFold(n_splits=5, shuffle=True, random_state=10))
y_pred
model_report = classification_report(y, y_pred)
print("Model report: \n", model_report)
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y, y_pred)
conf_matrix
Cross-validate¶
cross_validate
is a similar function, but a bit more versatile and informative, than cross_val_score
.
In addition to returning the scores, it returns some others metrics that may be useful, such as trained models, etc. It allows evaluating more than just one metric.
from sklearn.model_selection import cross_validate
scores = cross_validate(model, X, y, cv=kf)
scores
The next cell prints a list of all the metrics that we can use to evaluate with cross_validate
:
import sklearn
sorted(sklearn.metrics.SCORERS.keys())
Model parameters tuning¶
On previous sections we trained a KNeighborsClassifier
with 5 neighbors using the neighbors=5
parameter, but is this the best amount of neighbors? Can we boost our model by tunning this parameter?
All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters.
Quite often, it is not clear what the exact values of model parameters should be since they depend on the data at hand.
def get_kneighbors_score(k):
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=5)
return "Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2)
get_kneighbors_score(5)
get_kneighbors_score(2)
get_kneighbors_score(15)
get_kneighbors_score(40)
Validation curve¶
Now we could visualize the relation between the hyper-parameter k
and the accuracy.
parameters = [1, 3, 5, 8, 10, 12, 15, 18, 20, 25, 30, 50,60,80,90,100]
def get_kneighbors_score(k):
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=4)
return scores.mean()
ACC_dev = []
for k in parameters:
scores=get_kneighbors_score(k)
ACC_dev.append(scores)
ACC_dev
Let's plot the accuracy versus the number of neighbors:
f, ax = plt.subplots(figsize=(10,5))
plt.plot(parameters, ACC_dev,'o-', label='testing')
plt.axvline(x=10, ymin=0, ymax=1, color='k')
plt.xlabel('Neighbors')
plt.ylabel('Accuracy')
plt.grid()
plt.legend()
plt.plot()
print(f"Best parameters: {12}, Accuracy: {0.75}")
The validation curve
shows the validation and training score of an estimator for varying numbers of neighbors. To complete develop the curve we need to use cross_validate
.
Now we could visualize the relation between the hyper-parameter k
and the accuracy of training and testing using cross_validate
.
from sklearn.model_selection import cross_validate
knn_train_scores_mean = []
knn_train_scores_std = []
knn_test_scores_mean = []
knn_test_scores_std = []
k = np.arange(1,50,1)
for neighbors in k:
clf = KNeighborsClassifier(n_neighbors=neighbors)
knn_scores = cross_validate(clf, X, y, cv=5,
return_train_score=True, n_jobs=-1)
knn_train_scores_mean.append(knn_scores['train_score'].mean())
knn_train_scores_std.append(knn_scores['train_score'].std())
knn_test_scores_mean.append(knn_scores['test_score'].mean())
knn_test_scores_std.append(knn_scores['test_score'].std())
knn_train_scores_mean = np.array(knn_train_scores_mean)
knn_train_scores_std = np.array(knn_train_scores_std)
knn_test_scores_mean = np.array(knn_test_scores_mean)
knn_test_scores_std = np.array(knn_test_scores_std)
Plot the accuracy versus the number of neighbors
plt.fill_between(k, knn_train_scores_mean - knn_train_scores_std,
knn_train_scores_mean + knn_train_scores_std, alpha=0.1,
color="r")
plt.fill_between(k, knn_test_scores_mean - knn_test_scores_std,
knn_test_scores_mean + knn_test_scores_std, alpha=0.1, color="g")
plt.plot(k, knn_train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(k, knn_test_scores_mean, 'o-', color="g",
label="Test score")
plt.legend()
plt.ylabel('Accuracy')
plt.xlabel('neighbors')
plt.show()
Which is the best k parameter?
f, ax = plt.subplots(figsize=(10,5))
plt.axvline(x=18, ymin=0, ymax=1,color='k')
plt.fill_between(k, knn_train_scores_mean - knn_train_scores_std,
knn_train_scores_mean + knn_train_scores_std, alpha=0.1,
color="r")
plt.fill_between(k, knn_test_scores_mean - knn_test_scores_std,
knn_test_scores_mean + knn_test_scores_std, alpha=0.1, color="g")
plt.plot(k, knn_train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(k, knn_test_scores_mean, 'o-', color="g",
label="Test score")
plt.legend()
plt.ylabel('Accuracy')
plt.xlabel('neighbors')
plt.show()
Scikit-learn also provides tools to automatically find the best parameter combinations (via cross-validation).