Spot-checking algorithms on Tracks data¶
Your task will be to find the best algorithm to classify songs as being either 'Hip-Hop' or 'Rock'.
To do that you will apply a Spot-checking of different algorithms in order to discover which one might work the best.
We will use The Echo Nest song dataset, which contains tracks alongside the track metrics.
Hands on!¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the tracks_3.csv
dataset, and store it into tracks_df
.¶
This file already has wrong observations removed, and it is balanced.
# your code goes here
tracks_df = None
tracks_df = pd.read_csv('tracks_3.csv')
tracks_df.head()
Show the shape of the resulting tracks_df
.¶
# your code goes here
tracks_df.shape
# your code goes here
X = None
y = None
X = tracks_df.drop(['genre_top', 'genre_top_code'], axis=1)
y = tracks_df['genre_top_code']
Stantardize the features¶
Use the StandardScaler
to standardize the features (X
) before moving to model creation.
from sklearn.preprocessing import StandardScaler
# your code goes here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Define an evaluation function¶
Create a get_cv_scores
function that receives a model
parameter with a scikit-learn model and returns the CV scores of that model.
You should use a 5-fold
cross-validation. 5 scores should be returned.
from sklearn.model_selection import cross_val_score
def get_cv_scores(model):
# your code goes here
pass
from sklearn.model_selection import cross_val_score
def get_cv_scores(model):
return cross_val_score(model, X, y, cv=5)
Spot-check algorithms¶
Create each of the following models, and call the get_cv_scores
function using each model to get its CV scores.
Save the resulting scores in the results_df
to compare them at the end.
results_df = pd.DataFrame()
K Nearest Neighbors¶
from sklearn.neighbors import KNeighborsClassifier
# your code goes here
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
results_df['KNN'] = get_cv_scores(model)
Decision Trees¶
from sklearn.tree import DecisionTreeClassifier
# your code goes here
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
results_df['Decision Trees'] = get_cv_scores(model)
Support Vector Machines¶
from sklearn import svm
# your code goes here
from sklearn import svm
model = svm.SVC(gamma='auto',
random_state=10)
results_df['SVM'] = get_cv_scores(model)
Naive Bayes Classifier¶
from sklearn.naive_bayes import GaussianNB
# your code goes here
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
results_df['Naive Bayes'] = get_cv_scores(model)
Random Forest¶
from sklearn.ensemble import RandomForestClassifier
# your code goes here
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
results_df['Random Forest'] = get_cv_scores(model)
Gradient Boost Classifier¶
from sklearn.ensemble import GradientBoostingClassifier
# your code goes here
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=10)
results_df['GBC'] = get_cv_scores(model)
AdaBoost Classifier (Adaptive Boosting)¶
from sklearn.ensemble import AdaBoostClassifier
# your code goes here
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(random_state=10)
results_df['AdaBoost'] = get_cv_scores(model)
Present results¶
Show a boxplot per algorithm using the data you saved in results_df
.
Which one performs the best? And the worst?
# your code goes here
results_df.boxplot(figsize=(14,6), grid=False)