Profile picture

MLF - Machine Learning Methodologies

Last updated: April 9th, 20202020-04-09Project preview

rmotr


Machine Learning methodologies

A machine learning project may not be linear, but it has a number of well known steps, so it seems clear that it's necessary to have explicitly designed methodologies for go through these kind of projects.

We'll explore different methodologies which come from development software and are adapted to machine learning specific elements.

We can use these methodologies as a template for our projects and use them across platforms and programming languages.

green-divider

Knowledge Discovery in Databases (KDD)

KDD is a methodology organized as a cycle of 5 distinct phases:

  1. Selection: creation of a dataset to be used as the base for creation of models.
  2. Preprocessing: cleaning the dataset to obtain a consistent and aligned with the objectives dataset.
  3. Transformation: preparation/modification of the dataset to guide it towards specific objectives.
  4. Data mining: search for patterns in the data that allow us to build models, usually predictive.
  5. Interpretation: interpretation and evaluation of the obtained results.

Main Features

  • It is an iterative methodology that allows the establishment of short-term goals term and its gradual refinement for the final achievement of the overall objective.
  • Offers the possibility of going from the final state of each iteration to ANY of the previous steps without restricting the realization of complete cycles.
  • Only allows iterating from the final state of the process. Although we can return to any previous step, it forces us to continue the 'waterfall' from that point.
  • The definition of the objective pursued it is not included in this phase.
  • The production/operation of the systems we are looking to built is not included in this phase either

green-divider

Sample, Explore, Modify, Model, Assess (SEMMA)

SEMMA is a work methodology designed by SAS in which we find a cycle of 5 work phases:

  1. Sample: obtaining a sample of a dataset large enough to be representative and small enough to allow working in a way agile.
  2. Explore: data exploration to detect anomalies and trends so we can reach conclusions and ideas that facilitate the following phases.
  3. Modify: preparation/modification of the dataset to guide it towards specific objectives.
  4. Model: creation of models that allow reaching the objective set.
  5. Assess: evaluation of the results obtained by the models so we can understand their weaknesses and strengths to set the objectives of the following iterations.

Main Features

  • It is, directly, the interpretation of SAS on KDD.
  • It is more focused on modeling and prediction than on data exploration and obtaining conclusions.
  • ALL iterations have to be complete.
  • Only allows iterating from the final state of the process.
  • The definition of the objective pursued it is not included in this phase.
  • The production/operation of the systems we are looking to built is not included in this phase either

green-divider

Cross Industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM is a methodology of work organized as a 6-phase cycle:

  1. Business understanding: understanding of the business objectives and its translation to machine learning metrics.
  2. Data understanding: obtaining data, exploratory analysis, understanding of data quality, first insights.
  3. Data preparation: construction of the final dataset to be used from the cleaning and transformation of the original dataset.
  4. Modeling: training of different machine learning models and configuration/parameterization of them for optimal results.
  5. Evaluation: evaluation of the results obtained not only from the point of view of machine learning, but from the business point of view.
  6. Deployment: creation of the final deliverable (e.g. software, report, etc.) to be offered to the business and that must meet its objectives.

Main Features

  • It puts its initial focus on defining a clear business objective.
  • Keep in mind that the final result of the process is NOT a model, but must to be possible the deployment of that model (or its results) in production.
  • Allows, practically, the jump between ALL its phases.
  • Does not establish a fixed order and allows to reach conclusions in any of its phases and use them to "refine" previous phases even within the same iteration.
  • Although it is the "most realistic" approach in terms of operation, it is also the most complex to plan and manage since, potentially ANY step of the process can mean a redefinition of objectives.

green-divider

The Jason Brownlee approach

Jason Brownlee is the creator of the Machine Learning Mastery blog focused on helping Software developers (like him) to enter the world of machine learning.

Among its contents, it includes "his vision" about the process to follow when addressing a machine learning project. It is not an "official" methodology but a summary based on his experience.

The process that he proposes to carry out machine learning projects is organized as a 5 phase cycle:

  1. Define the problem: clearly define the problem you want to solve, its scope and its conditions of success.
  2. Prepare data: prepare the dataset(s) to be used to solve the trouble.
  3. Spot check algorithms: test and evaluate a "broad" set of Machine learning algorithms to try to solve the problem.
  4. Improve results: improve the results obtained through different techniques.
  5. Present results: present the final results obtained and/or develop the deliverable end of the process.

Main Features

  • Fully business oriented (such as CRISP-DM), it begins its process with the definition of the business objective and ends it with the determination of resolved objectives and its exploitation.
  • Joins tasks focused on data processing (selection, preprocessing and transformation) in a single step.
  • Assume that you cannot know which algorithm/model will work best and proposes the use of a broad set in the modeling phase.
  • It includes, explicitly, a step to carry out the improvement of the results of the algorithms/models.
  • Although it proposes that, as far as possible, the cycles should be completed, and it gives some freedom to jump between them.
  • It emphasizes that all jumps MUST be supported by a decision based in the business. There is a lot of flexibility in this process.

green-divider

Phases of a Machine Learning project

There are few standardized best practices across teams and companies in the industry and the academia.

In the First lecture, we briefly discussed about the Machine Learning Workflow. This section begins our journey through the phases of a machine learning system. These phases will serve you as a good foundational framework to help think through the problem, giving us a common language to talk about each step, and go deeper in the future.

green-divider

Phase 1: Define the problem

The first step in addressing any machine learning project is to clearly define the problem you are trying to solve and the business objective that is he wanted to be solved.

In this way, all the agents involved in the process (e.g. data scientist, data engineers, business users, etc.) will have a clear and common vision of the objectives

The main difficulty of this first phase of the process will be to translate the requirements of the business to specific metrics of the machine learning process and set the conditions of success of the process.

“One man’s constant is another man’s variable” Alan J. Perlis

In order to get this comprehensive definition of the problem we want to solve, we must ask ourselves the following questions:

  • What is the problem we want to solve? Describe the problem informally and formally and list assumptions and similar problems.
  • Why is it necessary to solve it? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.
  • How do we think we can solve it? Describe how the problem would be solved manually to flush domain knowledge.

green-divider

Phase 2: Prepare the data

Once we are clear about the problem we are trying to solve, the next step will be the collection and preparation of all the necessary data to solve it.

The preparation of data in any machine learning problem is CRUCIAL and will be, without a doubt, the more time consuming phase of the whole process.

“Garbage in, gospel out” vs. “Garbage in, garbage out”

This data preparation task can be subdivided into 3 blocks very well delimited:

  1. Data selection: review of the data we have available to solve to the problem posed and their collection.
  2. Data preprocessing: organization, exploration, cleaning and formatting of the selected data.
  3. Data transformation: transformation of preprocessed data for adapt it for the direct application of machine learning algorithms.

green-divider

Phase 3: Spot check algorithms

By having a correctly worked, clean and prepared dataset, we are ready to start the modeling phase.

It is important to understand that (at the moment) THERE IS NO ALGORITHM THAT ALWAYS OFFER OPTIMAL RESULTS regardless of the problem, objective or data on which it applies.

It is necessary to test multiple algorithms on our data and compare the obtained results. Doing this we will have (some more) security that our final model is optimal and we will have enough arguments to reject other alternatives.

“There is no silver bullet” Fred Brooks

This is also a good time to do any pertinent visualizations and previous analysis (Exploratory Data Analysis) of your dataset, to help you see if there are any relevant relationships between different variables you can take advantage of, as well as show you if there are any data imbalances.

Once you understand the need to run multiple tests on different algorithms, we must answer the following questions:


What algorithms to include in our test?


How to evaluate and compare the obtained results by each of them?

  1. Divide the dataset into two groups: train and test trying to ensure that both subsets of data maintain the properties of the original (e.g. same balancing of classes). Although there is no standard this partition is usually between 70/30 and 80/20.
  2. Train each model using only the train dataset. It is important that ALL algorithms receive EXACTLY the same data for training, otherwise the results will not be comparable.
  3. Evaluate the performance of the trained algorithms by making predictions about all observations of the test data and calculating a metric of global error/success of our predictions.

How to ensure that the results obtained will be maintained on new data?

To ensure this ability to generalize, the minimum thing we can do is to carry out a good division of our dataset between train and test, ensuring that both fragments they maintain the characteristics of the global dataset.

However, achieving that equality between the two fragments is not simple and it is possible that the results are affected by an unequal partition (e.g. imbalance, bias, etc.).

For this reason, it is highly recommended to use a more sophisticated approach to partition between train and test sets called cross validation (CV) that we will learn later.

green-divider

Phase 4: Improve results

After obtained the results by our "battery" algorithms, we begin to squeeze and tune them to achieve a final result that optimizes our objectives.

At this point the correct definition of the problem to be solved and setting real objectives takes ever more importance since the necessary time for this optimization will be generally high.

“20% of the code, consumes 80% of the time” Pareto principle

“First 90% of the code, consumes 90% of the time. Remaining 10% consumes remaining 90%.” Tom Cargill & Jon Bentley

Mainly we have at our disposal three well differentiated strategies to carry out the gradual improvement of the results obtained by any algorithm of machine learning:

  1. Review and improvement of datasets used for training.
  2. Optimization of the parameterization of the used learning algorithm.
  3. Composition of multiple algorithms (ensembles) to combine the goodness of your results.

It is important to bear in mind that, generally, making modifications on the used dataset for training is the alternative that gets the most improvements in the results of the algorithms.

green-divider

Phase 5: Present results

If the process began with the definition of a business problem, it must end with the verification that the problem has been resolved, and with the necessary elements that allow to put "in production" that said solution.

At this point in the process all the involved agents must have an unique, clear and common understanding of the obtained achievements so everyone has the same perception about the degree of success on the finished process.

“All models are wrong but some of them are useful” George E. P. Box

“My mind is made up, don’t confuse me with the facts!” Unknown


Report of results

As part of this final phase of results presentation, we have different elements that we should incorporate to present to the different agents, specially to the agents associated with the business.

  1. Context (Why): Define the environment in which the problem exists and set up the motivation for the research question.
  2. Problem (Question): Concisely describe the problem as a question that you went out and answered.
  3. Solution (Answer): Concisely describe the solution as an answer to the question you posed in the previous section. Be specific.
  4. Findings: Bulleted lists of discoveries you made along the way that interests the audience. They may be discoveries in the data, methods that did or did not work or the model performance benefits you achieved along your journey.
  5. Limitations: Consider where the model does not work ,or questions that the model does not answer. Do not shy away from these questions, defining where the model excels is more trusted if you can define where it does not excel.
  6. Conclusions (Why+Question+Answer): Revisit the "why", research question and the answer you discovered in a tight little package that is easy to remember and repeat for yourself and others.

purple-divider

Notebooks AI
Notebooks AI Profile20060