MLF - Our First Machine Learning Project

Last updated: April 28th, 20202020-04-28Project preview

rmotr


Our first Machine Learning project

In this lesson we will go through a simple machine learning application and create our first model. In the process, we will introduce some machine learning core concepts and terms.

green-divider

 Define the problem

Over the past few years, streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. But at the same time, the sheer amount of music on offer can mean users might be a bit overwhelmed when trying to look for newer themes that suits their tastes.

For this reason, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics.

In this lesson we'll be examining data compiled by a research group known as The Echo Nest.

Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will do some exploratory data visualization and prepare our data towards the goal of feeding our data through a simple machine learning algorithm.

We will introduce news concepts that will be analyzed in details in others lessons. There is no need to understand in details the whole lesson, just focus on key concepts.

green-divider

Descriptive analysis

To begin with, let's load into a pandas DataFrame the tracks.csv dataset that contains tracks alongside the track metrics compiled by The Echo Nest.

Each row represents a song, while each column represents the data from each song.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

tracks = pd.read_csv('tracks.csv')

A song is about more than its title, artist, and number of listens. We will analize musical features of each track such as danceability and acousticness on a scale from -1 to 1.

Let's see first observations of the data:

In [2]:
tracks.head()
Out[2]:
track_id acousticness danceability energy instrumentalness liveness speechiness tempo valence genre_top
0 2 0.416675 0.675894 0.634476 0.010628 0.177647 0.159310 165.922 0.576661 Hip-Hop
1 3 0.374408 0.528643 0.817461 0.001851 0.105880 0.461818 126.957 0.269240 Hip-Hop
2 5 0.043567 0.745566 0.701470 0.000697 0.373143 0.124595 100.260 0.621661 Hip-Hop
3 134 0.452217 0.513238 0.560410 0.019443 0.096567 0.525519 114.290 0.894072 Hip-Hop
4 153 0.988306 0.255661 0.979774 0.973006 0.121342 0.051740 90.241 0.034018 Rock
In [3]:
tracks.shape
Out[3]:
(4802, 10)

We have measurements for 4802 different songs. Each individual item is called sample in machine learning, and their properties are called features.

The shape of the data array is the number of samples multiplied by the number of features.

Also we should get an idea of the types of the attributes we have:

In [4]:
tracks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4802 entries, 0 to 4801
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          4802 non-null   int64  
 1   acousticness      4802 non-null   float64
 2   danceability      4802 non-null   float64
 3   energy            4802 non-null   float64
 4   instrumentalness  4802 non-null   float64
 5   liveness          4802 non-null   float64
 6   speechiness       4802 non-null   float64
 7   tempo             4802 non-null   float64
 8   valence           4802 non-null   float64
 9   genre_top         4802 non-null   object 
dtypes: float64(8), int64(1), object(1)
memory usage: 375.3+ KB

Statistical Summary

We can take a look at a summary of each attribute.

This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).

In [5]:
tracks.describe()
Out[5]:
track_id acousticness danceability energy instrumentalness liveness speechiness tempo valence
count 4802.000000 4.802000e+03 4802.000000 4802.000000 4802.000000 4802.000000 4802.000000 4802.000000 4802.000000
mean 30164.871720 4.870600e-01 0.436556 0.625126 0.604096 0.187997 0.104877 126.687944 0.453413
std 28592.013796 3.681396e-01 0.183502 0.244051 0.376487 0.150562 0.145934 34.002473 0.266632
min 2.000000 9.491000e-07 0.051307 0.000279 0.000000 0.025297 0.023234 29.093000 0.014392
25% 7494.250000 8.351236e-02 0.296047 0.450757 0.164972 0.104052 0.036897 98.000750 0.224617
50% 20723.500000 5.156888e-01 0.419447 0.648374 0.808752 0.123080 0.049594 124.625500 0.446240
75% 44240.750000 8.555765e-01 0.565339 0.837016 0.915472 0.215151 0.088290 151.450000 0.666914
max 124722.000000 9.957965e-01 0.961871 0.999768 0.993134 0.971392 0.966177 250.059000 0.983649

green-divider

 Visualizing data and its relationships

Before diving into the creation of a machine learning model, it is often a good idea to inspect the data, to see if the task is easily solvable without machine learning, or if the desired information might not be contained in the data.

Now we can look at the interactions between the variables.

In [6]:
from pandas.plotting import scatter_matrix

ax = scatter_matrix(tracks[['acousticness', 'danceability', 'liveness', 'speechiness']],
                    figsize=(12,12))

We can repeat the plot above but this time coloring 'Hip-Hop' and 'Rock' observations:

In [7]:
from pandas.plotting import scatter_matrix

colors = [0 if track == 'Hip-Hop' else 1 for track in tracks['genre_top']]

ax = scatter_matrix(tracks[['acousticness', 'danceability', 'liveness', 'speechiness']],
                    c=colors,
                    cmap=plt.cm.Spectral,
                    figsize=(12,12))

plt.legend([plt.plot([],[],color=plt.get_cmap('Spectral')(i/1.),
                     ls='', marker='o', markersize=10)[0] for i in range(2)],
           ['Hip-Hop', 'Rock'],
           loc=(1.02, 3.8))
Out[7]:
<matplotlib.legend.Legend at 0x7f822e7c5730>