Profile picture

MLF - Scikit-Learn Introduction

Last updated: April 10th, 20202020-04-10Project preview

rmotr


Scikit-learn introduction

This lesson will illustrate some of the main features that scikit-learn library provides.

green-divider

 What is scikit-learn?

Scikit-learn, or sklearn, is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

  • Simple and efficient tools for predictive data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

Where did it come from?

Scikit-learn was initially developed by David Cournapeau as a Google summer of code project in 2007. Later Matthieu Brucher joined the project and started to use it as apart of his thesis work. In 2010 INRIA got involved and the first public release (v0.1 beta) was published in late January 2010.

green-divider

 What can I do with scikit-learn?

scikit-learn offers various important features for machine learning such as classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to inter-operate with the python numerical and scientific libraries like NumPy and SciPy.

It provides a large list of tools and algorithms, divided in 6 categories:

  • Classification, to identify which category an object belongs to.
  • Regression, to predict a continuous-valued attribute associated with an object.
  • Clustering, to perform automatic grouping of similar objects into sets.
  • Dimensionality reduction, to reduce the number of random variables to consider.
  • Model selection, to compare, validate and choose parameters and models.
  • Preprocessing, to extract and normalize features.

green-divider

 Estimator Interface

All implemented algorithms in scikit-learn, whether preprocessing, supervised learning, or unsupervised learning algorithms, are implemented as classes.

These classes are called Estimators in scikit-learn. To apply an algorithm, you first have to instantiate an object of the particular class.

For example, let's see how can we perform a Linear Regression:

In [1]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model
Out[1]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

The estimator class contains the algorithm, and also stores the model that is learned from data using the algorithm.

You should set any parameters of the model when constructing the model object. These parameters include regularization, complexity control, number of clusters to find, etc.


 Fit method

estimator.fit(x_train, [y_train])

All estimators have a fit method, which is used to build the model.

The fit method always requires as its first argument the data X, represented as a NumPy array or a SciPy sparse matrix, where each row represents a single data point. The data X is always assumed to be a NumPy array or SciPy sparse matrix that has continuous (floating-point) entries.

Supervised algorithms also require a y argument, which is a one-dimensional NumPy array containing target values (labels) for regression or classification (i.e., the known output labels or responses).

In [2]:
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100,
                       n_features=1,
                       noise=25)

X.shape, y.shape
Out[2]:
((100, 1), (100,))
In [3]:
X[0], y[0]
Out[3]:
(array([-1.61716347]), -151.39520459506852)
In [4]:
X[1], y[1]
Out[4]:
(array([0.29028644]), -0.7213009345535681)
In [5]:
import numpy as np
import matplotlib.pyplot as plt

plt.figure(figsize=(14,6))

plt.scatter(X, y, color='black')
Out[5]:
<matplotlib.collections.PathCollection at 0x7f70a7e6c5e0>
In [6]:
model.fit(X, y)
Out[6]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

 Predict method

estimator.predict(X_test)

There are two main ways to apply a learned model in scikit-learn.

To create a prediction in the form of a new output like $y$, you use the predict method.

To create a new representation of the input data, you use the transform method.

We will cover this transform method in detail on upcoming lessons.

In [7]:
y_pred = model.predict(X)

y_pred
Out[7]:
array([-124.30222251,   23.49389157,   87.448622  ,   37.9598034 ,
         84.76407932,   14.39316367,  -47.2993499 , -100.43274483,
        -75.01985418,  107.3650417 , -117.39496237,   -6.8525841 ,
         62.37559499,  -22.84183179,    9.57862592,   12.67465757,
         57.83220658, -102.59697027, -153.1154867 ,   24.04226766,
         28.1974725 ,   89.24846365,  -21.17538135,  -13.06641259,
         85.75124509,   -2.10622036,   19.05662316,   88.73288309,
         13.44865715,   -0.92037503, -115.67689674,  -49.37605784,
          2.62303945,  -69.27220774,   -3.72821624,  -37.24657019,
         93.0794118 ,  -42.17860679,   82.67925489,  -65.58794094,
         37.77842644,  138.34598659,    4.79427828,  -37.21171195,
       -198.96458907,  -53.54455543,   57.82043685, -143.11790603,
        -46.53490505,  -54.43135314,  152.45886583,   63.44158927,
        -51.86542567,   35.51213373, -139.03836804,   -4.63276832,
         40.5960386 ,  105.12961839,  -78.94291145,  -63.77235295,
        -78.13775808,   80.06498191,  -63.58363576,   19.73101283,
         -2.36598812,  -77.70151343,  -58.65186543,   26.43025775,
        -19.1102709 ,   72.43879097,   13.88420069,   -3.77813869,
        -99.40136123,   24.11972087,  -25.52054068,   20.49193469,
        -11.01544966,  -54.82231538,  -86.92262497,   93.02984362,
         59.70492452,   -8.66283064,   86.93310337,  -18.95831007,
         51.90274059,  154.95776401,  -95.92804778,  -90.96178283,
         67.55699472,  145.98523492,   46.18895552,   42.52497165,
        -13.56359046, -156.85927145,  -78.68504164,  129.94395328,
         18.57455349,   16.34228373,   62.36433032,   -8.50779037])
In [8]:
X[0], y[0], y_pred[0]
Out[8]:
(array([-1.61716347]), -151.39520459506852, -124.30222251036545)
In [9]:
X[1], y[1], y_pred[1]
Out[9]:
(array([0.29028644]), -0.7213009345535681, 23.493891565544452)
In [10]:
plt.figure(figsize=(14,6))

plt.scatter(X, y, color='black')

plt.plot(X, y_pred, color='red', linewidth=2)
Out[10]:
[<matplotlib.lines.Line2D at 0x7f70ac17be20>]

 Score method

estimator.score(X_test, y_test)

Additionally, all supervised models have a score method that allows an evaluation of the model.

In [11]:
model.score(X, y)
Out[11]:
0.8971245611086263

green-divider

 Choosing the right estimator

Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.

Different estimators are better suited for different types of data and different problems.

The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.

purple-divider

Notebooks AI
Notebooks AI Profile20060