# MLF - What Is Machine Learning?

Last updated: April 9th, 2020

# What is Machine Learning?¶

What is Machine Learning? The only thing we know for sure is that Machine Learning is a broad term. It depends on the organization/company and people involved.

## Data Science ecosystem¶

The diagram below is a good representation of what Machine Learning and Data Science involve. It was created by Drew Conway to explain the "data science" term, as he thinks it is a bit of a misnomer.

The difficulty in defining these skills is that the difference between substance and methodology is ambiguous, and as such it is unclear how to distinguish them among hackers, statisticians, subject matter experts, their overlaps and where data science fits.

What is clear, however, is that you should acquire multiple skills if you aim to become a fully competent data scientist. By simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add his own thoughts to what is already a crowded market of ideas, Conway presents the Data Science Venn Diagram:

Source

Also the SAS Institute give us it's own definition of Data Science with this diagram:

Although, this one might be a little bit more accurate:

As a "Data Scientist" you'll be combining your domain knowledge "science", with large amounts of data, using programming and computers. Your tasks as Data Scientists will involve highly technical tasks like Machine Learning to boring and repetitive operative tasks like scraping a website from data and importing it in a Database.

## What does a Data Scientist do?¶

It's probably easier to define the tasks that are most common to Data Scientists that to give a general definition that applies to everybody.

Source from recommended article

### Getting the data¶

Depending on your company/organization, getting the data can be as simple as a SQL query or as difficult as scraping entire websites. The problem with this task is that it's not standardized.

### Parsing and Cleaning the Data¶

Depending on your sources, you'll need to do a little bit of preparation. Excluding outliers, filling null values, translating values, etc.

### Merging, combining data¶

If the data comes from different sources, merging it can be tedious. Specially if it's hard to define that piece of information that relates different data sources.

### Doing the analysis¶

This involves your own domain expertise + the tools available for the job. For example, you need to know the principles of statistics and you can also use statsmodels to simplify your job. The analysis part is usually iterative and involves other subtasks as visualizations, validation testing, etc.

### Building models¶

The whole point of the analysis part is finding patterns in particular cases to build general models. Your models can be predictions, clusterings, or just automated reports. In a general sense, it's the result of all the previous phases.

### Deploying it¶

Perfect analyses and models are useless if they're not repeatable and scalable. This phase depends on your own models, but it'll usually imply a cloud provider. From simple reports (emailed every day at 3AM from an AWS Lambda) to a Machine Learning Model (built on top of Tensor Flow and deployed on Google's cloud infrastructure).

## The only thing that's certain¶

We're sorry we can't give you better answers to the question "What is Data Science?", but this is a new discipline and we're putting together definitions, scopes and responsibilities of different actors. But there's one thing that's certain: You need to know how to code...

Source

## Why Machine Learning?¶

In the early days of "intelligent" applications, many systems used handcoded rules of "if" and "else" decisions to process data or adjust to user input.

Think of a spam filter whose job is to move the appropriate incoming email messages to a spam folder. You could make up a blacklist of words that would result in an email being marked as spam. This would be an example of using an expert-designed rule system to design an "intelligent" application.

Manually crafting decision rules is feasible for some applications, particularly those in which humans have a good understanding of the process to model. However, using handcoded rules to make decisions has two major disadvantages:

• The logic required to make a decision is specific to a single domain and task. Changing the task even slightly might require a rewrite of the whole system.
• Designing rules requires a deep understanding of how a decision should be made by a human expert.

One example of where this handcoded approach will fail is in detecting faces in images. Today, every smartphone can detect a face in an image. However, face detection was an unsolved problem until as recently as 2001. The main problem is that the way in which pixels (which make up an image in a computer) are "perceived" by the computer is very different from how humans perceive a face. This difference in representation makes it basically impossible for a human to come up with a good set of rules to describe what constitutes a face in a digital image.

Using machine learning, however, simply presenting a program with a large collection of images of faces is enough for an algorithm to determine what characteristics are needed to identify a face.

## What is not Machine Learning?¶

Machine Learning, Data Science, Statistical Inference, Artificial Intelligence. In general, they will be used to mix concepts or even be used as if they were synonyms, when they really aren't.

However, most of the limitations coincide in certain key aspects:

• It focuses on the use / creation of algorithms.
• It applies to data sets in which we want to find "something".
• The ultimate goal is (usually) to carry out a prediction.
• There is no need to explicitly program rules that take us directly to the final results.

### Statistical inference¶

It is a field that comes from mathematics, based on the assumption that a data set has been "generated" by certain probability distribution.

Statistical inference seeks to model the origin of the data so that it can infer behaviors on other not studied data set.

### Data Science¶

As we saw before Data Science seeks to generate insights from a data set. The focus is on understanding the content of the data.

👉 Conclusions and decisions are delegated in a human agent.

### Artificial Intelligence¶

It seeks to generate actions from a dataset. The focus is on the benefit generated by the actions.

👉 Conclusions and decisions are automated.

### Machine Learning¶

It is a field that comes from computing, based on the search for relationships within the data under study without assuming anything about them.

Machine Learning seeks to model these relationships so that they can generalize their behavior on other not studied data.

It also seeks to generate predictions from a dataset. The focus is on the accuracy of the predictions made.

👉 Conclusions are automated, decisions are delegated in a human agent.

## Different things, yes... but complementary¶

Let's see an example. Suppose we are working on creating an autonomous car, specifically to get it to stop at a stop sign.

Machine learning: we need a model / algorithm that allows, from an image, to recognize if there is a stop sign in it. This means that we need to predict if the content of an image is a stop sign or not.

Artificial intelligence: once we are able to recognize a stop signal, it will be necessary to make the decision to stop or not, when to do it, how to do it, etc.

Data science: before building out our model / algorithm, we will have to understand the data we have to train it. Once trained, we will have to be able to evaluate the obtained results and detect in the data any anomaly that may affect model performance.

## What is Machine Learning?¶

There is not an unique definition of what Machine Learning is.

In many ways, Machine Learning is the primary means by which data science manifests itself to the broader world. Machine Learning is where these computational and algorithmic skills of data science meet the statistical thinking, and the result is a collection of approaches to make inferences and also data exploration.

In summary and in a very basic way, any given Machine learning process is based on:

1. Take a dataset that serve as an example of a particular situation on which we want to be able to draw conclusions/predict in the future.
2. Provide this data to an algorithm that, internally and automatically, will detect and learn patterns about them. That is, "train" an algorithm with data on a given situation.
3. Use the trained algorithm to obtain conclusions/predictions on new data other than those used for training.

Depending on the content of the data provided and the conclusions/predictions that we want to obtain, there will be different types of machine learning that we'll analyze later.

### More definitions¶

"Machine learning is a type of artificial intelligence that allows software applications to become more accurate in predicting outcomes without being explicitly programmed". WhatIs.com

"Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed". Wikipedia

"Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem". Medium

"Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world". Nvidia

## Machine Learning Workflow¶

Previously, we discussed about the Data Science Workflow in 'What a Data Science do?' section. In this stage we will focus on Machine Learning Workflow.

### Getting the data¶

Depending on your company/organization, getting the data can be as simple as a SQL query or as difficult as scraping entire websites. The problem with this task is that it's not standardized.

### Data pre-processing¶

Data pre-processing is one of the most important steps in machine learning. It is a key step that helps in building machine learning models more accurately. Data pre-processing is a process of cleaning the raw data i.e. the data is collected in the real world and it is converted to a clean data set. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and it is likely to contain many errors. Data pre-processing is a proven method of resolving such issues, that include the following steps:

• Parsing and Cleaning the Data

• Merging, combining data

• Data Reduction

### Doing the analysis¶

This involves your own domain expertise + the tools available for the job. For example, you need to know the principles of statistics and you can also use statsmodels to simplify your job. The analysis part is usually iterative and involves other subtasks as visualizations, validation testing, etc.

#### Exploratory Data Analysis¶

Exploratory Data Analysis (EDA) (Tukey, 1977) is used to understand the major characteristics of the predictors and outcome so that any particular challenges associated with the data can be discovered prior to modeling.

‘Understanding the dataset’ can refer to a number of things including but not limited to…

• Extracting important variables and leaving behind useless variables
• Identifying outliers, missing values, or human error
• Understanding the relationship(s), or lack of, between variables
• Ultimately, maximizing your insights of a dataset and minimizing potential error later in the process

### Building models¶

The whole point of the analysis part is finding patterns in particular cases to build general models. This phase include training, evaluate and tunning the model:

• The process of training an Machine Learning (ML) model involves providing an ML algorithm (that is, the learning algorithm) with training data to learn from. The term ML model refers to the model artifact that is created by the training process.

• Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well will the chosen model will work in the future.

• Tuning is the process of maximizing a model's performance.

### Deploying it¶

Perfect analyses and models are useless if they're not repeatable and scalable. This phase depends on your own models, but it'll usually imply a cloud provider. From simple reports (emailed every day at 3AM from an AWS Lambda) to a Machine Learning Model (built on top of Tensor Flow and deployed on Google's cloud infrastructure).

### Monitor the predictions on an ongoing basis.¶

Model monitoring - Once a model has made it into production, it must be monitored in order to ensure that everything is working properly. Monitoring each machine learning model requires attention coming from many different perspectives to ensure that each aspect of the model is running accurately and efficiently.

## Introducing Problems that Machine Learning can solve¶

The most successful machine learning algorithms are those that automate decision making processes by generalizing from known examples.

The classical example is SPAM CLASSIFICATION, where the user provides a large number of emails, together with information about whether any of these emails are SPAM. Given a new email the algorithm will estimate a prediction as to whether the new email is or not SPAM. This model is known as Supervised learning (we will cover this later). Others examples are:

• Identifying the zip code from handwritten digits on an envelope
• Determining whether a tumor is benign based on a medical image
• Detecting fraudulent activity in credit card transactions

There are others type of algorithm that we will cover in this course, known as Unsupervised learning, where only the input is known. For example:

• Identifying topics in a set of blog posts
• Segmenting customers into groups with similar preferences
• Detecting abnormal access patterns to a website

However, it is important to remark that machine learning it is not necessary accurate in every situations. In some cases the use of machine learning is just unnecessary and in some others its implementation can get you into difficulties.

## Understanding the dataset and the problem¶

One of the most important parts in the machine learning process is understanding the data you are working with and how it relates to the task you want to solve. If you understand the problem clearly, you should be able to list some potential solutions to test in order to generate the best model.

Understand that you will likely have to try out a few solutions before you land on a good working model.

Key questions to keep in mind:

• What question(s) am I trying to answer?
• Do I think the data collected can answer that question?
• Have I collect enough data to represent the problem I want to solve?
• What feature of the data should I extract?
• How will I measure the performance in my application?

## Essential libraries and tools¶

Today, Python is one of the most popular programming languages for this task and it has replaced many languages in the industry, one of the reasons is the vast collection of libraries it has.

Python has libraries for data loading, visualization, statistics, natural language processing, image processing, and more. Python libraries mostly used in Machine Learning are:

• NumPy
• Pandas
• Matplotlib
• Seaborn
• Scipy
• Scikit-learn

NumPy is a very popular python library for large multi-dimensional array and matrix processing, with the help of a large collection of high-level mathematical functions. It is one of the fundamental packages for scientific computing in Python. It is very useful for fundamental scientific computations in Machine Learning. It is also particularly useful for linear algebra, Fourier transformations, and random number capabilities.

Pandas is a popular Python library for data analysis. As we know, datasets should be prepared before using it to train a model. In this case, Pandas comes in handy as it was developed specifically for data extraction and preparation. It provides high-level data structures and wide variety tools for data analysis. It provides many built-in methods for groping, combining and filtering data.

Matplotlib is a Python library for data visualization. It is useful for visualizing patterns in the data, as it is a 2D plotting library used for creating 2D graphs and plots. A module named pyplot makes it easy for programmers to plot, as it provides features to control line styles, font properties, formatting axes, etc.

Seaborn provides an API on top of Matplotlib that offers severals choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas DataFrames.

Scikit-learn is the most popular Machine Learning library. It is an open source project, that is constantly being developed and improved. It is built on top of two basic Python libraries, NumPy and SciPy. Scikit-learn supports most of the supervised and unsupervised learning algorithms. Scikit-learn can also be used for data-mining and data-analysis, which makes it a great tool who is starting out with machine learning. It is widely used on industry and academics.

SciPy is a very popular library among Machine Learning enthusiasts as it contains different modules for mathematical functions optimization, advanced linear algebra, signal processing, integration and statistics.