Profile picture

MLF - Spot-Checking Algorithms on Tracks Data

Last updated: July 11th, 20202020-07-11Project preview

rmotr


Spot-checking algorithms on Tracks data

Your task will be find the best algorithm to classify songs as being either 'Hip-Hop' or 'Rock'.

To do that you will apply a Spot-checking of different algorithms in order to discover which one might work the best.

We will use The Echo Nest song dataset, which contains tracks alongside the track metrics.

separator2

Hands on!

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

 Load the data/tracks_3.csv file, and store it into tracks_df DataFrame.

This file already has wrong observations removed, and it is balanced.

In [ ]:
# your code goes here
tracks_df = None
In [ ]:
tracks_df = pd.read_csv('data/tracks_3.csv')

tracks_df.head()

Show the shape of the resulting tracks_df.

In [ ]:
# your code goes here
In [ ]:
tracks_df.shape

green-divider

Data preparation

Before modeling prepare the data:

Create features $X$ and labels $y$

In [ ]:
# your code goes here
X = None
y = None
In [ ]:
X = tracks_df.drop(['genre_top', 'genre_top_code'], axis=1)
y = tracks_df['genre_top_code']

 Stantardize the features

Use the StandardScaler to standardize the features (X) before moving to model creation.

In [ ]:
from sklearn.preprocessing import StandardScaler

# your code goes here
In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)

green-divider

 Define an evaluation function

Create a get_cv_scores function that receives a model parameter with a scikit-learn model and returns the CV scores of that model.

You should use a 5-fold cross-validation. 5 scores should be returned.

In [ ]:
from sklearn.model_selection import cross_val_score

def get_cv_scores(model):
    # your code goes here
    pass
In [ ]:
from sklearn.model_selection import cross_val_score

def get_cv_scores(model):
    return cross_val_score(model, X, y, cv=5)

green-divider

Spot-check algorithms

Create each of the following models and call the get_cv_scores function using each model to get its CV scores.

Save the resulting scores in the results_df to compare them at the end.

In [ ]:
results_df = pd.DataFrame()

K Nearest Neighbors

In [ ]:
from sklearn.neighbors import KNeighborsClassifier

# your code goes here
In [ ]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

results_df['KNN'] = get_cv_scores(model)

Decision Trees

In [ ]:
from sklearn.tree import DecisionTreeClassifier

# your code goes here
In [ ]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

results_df['Decision Trees'] = get_cv_scores(model)

Support Vector Machines

In [ ]:
from sklearn import svm

# your code goes here
In [ ]:
from sklearn import svm

model = svm.SVC(gamma='auto',
                random_state=10)

results_df['SVM'] = get_cv_scores(model)

Naive Bayes Classifier

In [ ]:
from sklearn.naive_bayes import GaussianNB

# your code goes here
In [ ]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

results_df['Naive Bayes'] = get_cv_scores(model)

 Random Forest

In [ ]:
from sklearn.ensemble import RandomForestClassifier

# your code goes here
In [ ]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)

results_df['Random Forest'] = get_cv_scores(model)

 Gradient Boost Classifier

In [ ]:
from sklearn.ensemble import GradientBoostingClassifier

# your code goes here
In [ ]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=10)

results_df['GBC'] = get_cv_scores(model)

AdaBoost Classifier (Adaptive Boosting)

In [ ]:
from sklearn.ensemble import AdaBoostClassifier

# your code goes here
In [ ]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state=10)

results_df['AdaBoost'] = get_cv_scores(model)

green-divider

Present results

Show a boxplot per algorithm using the data you saved in results_df.

Which one performs the best? And the worst?

In [ ]:
# your code goes here
In [ ]:
results_df.boxplot(figsize=(14,6), grid=False)

separator2

Notebooks AI
Notebooks AI Profile20060