Our first Machine Learning project¶
In this lesson we will go through a simple machine learning application and create our first model. In the process, we will introduce some machine learning core concepts and terms.
Define the problem¶
Over the past few years, streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. But at the same time, the sheer amount of music on offer can mean users might be a bit overwhelmed when trying to look for newer themes that suits their tastes.
For this reason, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics.
In this lesson we'll be examining data compiled by a research group known as The Echo Nest.
Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will do some exploratory data visualization and prepare our data towards the goal of feeding our data through a simple machine learning algorithm.
We will introduce news concepts that will be analyzed in details in others lessons. There is no need to understand in details the whole lesson, just focus on key concepts.
Descriptive analysis¶
To begin with, let's load into a pandas DataFrame the tracks.csv
dataset that contains tracks alongside the track metrics compiled by The Echo Nest.
Each row represents a song, while each column represents the data from each song.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
tracks = pd.read_csv('tracks.csv')
A song is about more than its title, artist, and number of listens. We will analize musical features of each track such as danceability and acousticness on a scale from -1 to 1.
Let's see first observations of the data:
tracks.head()
tracks.shape
We have measurements for 4802 different songs. Each individual item is called sample in machine learning, and their properties are called features.
The shape of the data array is the number of samples multiplied by the number of features.
Also we should get an idea of the types of the attributes we have:
tracks.info()
Statistical Summary¶
We can take a look at a summary of each attribute.
This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).
tracks.describe()
Visualizing data and its relationships¶
Before diving into the creation of a machine learning model, it is often a good idea to inspect the data, to see if the task is easily solvable without machine learning, or if the desired information might not be contained in the data.
Now we can look at the interactions between the variables.
from pandas.plotting import scatter_matrix
ax = scatter_matrix(tracks[['acousticness', 'danceability', 'liveness', 'speechiness']],
figsize=(12,12))
We can repeat the plot above but this time coloring 'Hip-Hop' and 'Rock' observations:
from pandas.plotting import scatter_matrix
colors = [0 if track == 'Hip-Hop' else 1 for track in tracks['genre_top']]
ax = scatter_matrix(tracks[['acousticness', 'danceability', 'liveness', 'speechiness']],
c=colors,
cmap=plt.cm.Spectral,
figsize=(12,12))
plt.legend([plt.plot([],[],color=plt.get_cmap('Spectral')(i/1.),
ls='', marker='o', markersize=10)[0] for i in range(2)],
['Hip-Hop', 'Rock'],
loc=(1.02, 3.8))