Our first Machine Learning project¶
In this lesson we will go through a simple machine learning application and create our first model. In the process, we will introduce some machine learning core concepts and terms.
Define the problem¶
Over the past few years, streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. But at the same time, the sheer amount of music on offer can mean users might be a bit overwhelmed when trying to look for newer themes that suits their tastes.
For this reason, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics.
In this lesson we'll be examining data compiled by a research group known as The Echo Nest.
Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will do some exploratory data visualization and prepare our data towards the goal of feeding our data through a simple machine learning algorithm.
We will introduce news concepts that will be analyzed in details in others lessons. There is no need to understand in details the whole lesson, just focus on key concepts.
import numpy as np import pandas as pd import matplotlib.pyplot as plt tracks = pd.read_csv('tracks.csv')
A song is about more than its title, artist, and number of listens. We will analize musical features of each track such as danceability and acousticness on a scale from -1 to 1.
Let's see first observations of the data:
We have measurements for 4802 different songs. Each individual item is called sample in machine learning, and their properties are called features.
The shape of the data array is the number of samples multiplied by the number of features.
Also we should get an idea of the types of the attributes we have:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4802 entries, 0 to 4801 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_id 4802 non-null int64 1 acousticness 4802 non-null float64 2 danceability 4802 non-null float64 3 energy 4802 non-null float64 4 instrumentalness 4802 non-null float64 5 liveness 4802 non-null float64 6 speechiness 4802 non-null float64 7 tempo 4802 non-null float64 8 valence 4802 non-null float64 9 genre_top 4802 non-null object dtypes: float64(8), int64(1), object(1) memory usage: 375.3+ KB
We can take a look at a summary of each attribute.
This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).
Visualizing data and its relationships¶
Before diving into the creation of a machine learning model, it is often a good idea to inspect the data, to see if the task is easily solvable without machine learning, or if the desired information might not be contained in the data.
Now we can look at the interactions between the variables.
from pandas.plotting import scatter_matrix ax = scatter_matrix(tracks[['acousticness', 'danceability', 'liveness', 'speechiness']], figsize=(12,12))
We can repeat the plot above but this time coloring 'Hip-Hop' and 'Rock' observations:
from pandas.plotting import scatter_matrix colors = [0 if track == 'Hip-Hop' else 1 for track in tracks['genre_top']] ax = scatter_matrix(tracks[['acousticness', 'danceability', 'liveness', 'speechiness']], c=colors, cmap=plt.cm.Spectral, figsize=(12,12)) plt.legend([plt.plot(,,color=plt.get_cmap('Spectral')(i/1.), ls='', marker='o', markersize=10) for i in range(2)], ['Hip-Hop', 'Rock'], loc=(1.02, 3.8))
<matplotlib.legend.Legend at 0x7fc896c54730>