Friday, September 20, 2024

ML Zoomcamp - Module 01 (Introduction to Machine Learning)

Module 01: Introduction to Machine Learning

 

1.1 Introduction to Machine Learning

Machine Learning (ML) is the field of study where computers learn from data to make predictions or decisions without being explicitly programmed. It powers applications like recommendation systems, fraud detection, and predictive maintenance.

The concept of ML is depicted with an example of predicting the price of a car. The ML model learns from data, represented as some features such as year, mileage, among others, and the target variable, in this case, the car's price, by extracting patterns from the data.

Then, the model is given new data (without the target) about cars and predicts their price (target).

In summary, ML is a process of extracting patterns from data, which is of two types:

- features (information about the object) and

- target (property to predict for unseen objects).

Therefore, new feature values are presented to the model, and it makes predictions from the learned patterns. 

 

1.2 ML vs Rule-Based Systems

- Rule-Based Systems: Operate using predefined rules crafted by humans, often leading to rigid and non-scalable solutions. The traditional Rule-Based systems are based on a set of characteristics (keywords, email length, etc.) that identify an email as spam or not. As spam emails keep changing over time the system needs to be upgraded making the process un-tractable due to the complexity of code maintenance as the system grows.

- Machine Learning: Learns patterns from data, allowing models to generalize better, adapt to new inputs, and improve over time.

ML can be used to solve this problem with the following steps:

1. Get data

Emails from the user's spam folder and inbox give examples of spam and non-spam. 

2. Define and calculate features

Rules/characteristics from rule-based systems can be used as a starting point to define features for the ML model. The value of the target variable for each email can be defined based on where the email was obtained from (spam folder or inbox). 

Each email can be encoded (converted) to the values of its features and target. 

3. Train and use the model

A machine learning algorithm can then be applied to the encoded emails to build a model that can predict whether a new email is spam or not spam. The predictions are probabilities, and to make a decision it is necessary to define a threshold to classify emails as spam or not spam. 

 

1.3 Supervised Machine Learning

Supervised learning is a subset of ML where the model learns from labeled data.

In Supervised Machine Learning (SML) there are always labels associated with certain features.

The model is trained, and then it can make predictions on new features. In this way, the model is taught by certain features and targets. 

Feature matrix (X): made of observations or objects (rows) and features (columns).

Target variable (y): a vector with the target information we want to predict. For each row of X there's a value in y.

The model can be represented as a function g that takes the X matrix as a parameter and tries to predict values as close as possible to y targets. The obtention of the g function is what it is called training. 

Types of SML problems

 - Regression: Predicting continuous values (e.g., house prices, car's price).

- Classification: Predicting categorical labels (e.g., spam vs. non-spam emails).

              - Binary: there are two categories.

              - Multiclass problems: there are more than two categories.

- Ranking: the output is the top scores associated with corresponding items. It is applied in recommender systems. 

In summary, SML is about teaching the model by showing different examples, and the goal is to come up with a function that takes the feature matrix as a parameter and makes predictions as close as possible to the y targets. 

 

1.4 CRISP-DM (Cross-Industry Standard Process for Data Mining)

CRISP-DM is a methodology used in data science and ML projects, involving six phases:

1.       Business understanding

An important question is do we need ML for the project. The goal of the project has to be measurable.

2.       Data understanding

Analyse available data sources, and decide if more data is required.

3.       Data preparation

Clean data, remove noise applying pipelines, and convert the data to a tabular format, so we can put it into ML.

4.       Modelling

Train the different models and choose the best one. Considering the results of this step, it is proper to decide if it is required to add new features or fix data issues.

5.       Evaluation

Measure how well the model is performing and if it solves the business problem.

6.       Deployment

Roll out to production to all the users. The evaluation and deployment often happen together - online evaluation. 

 

1.5 The Modelling Step (Model Selection Process)

Key steps in the modelling process include:

Which model to choose?

              - Logistic regression

              - Decision tree

              - Neural Network

              - Or many others             

The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values. 

Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic. 

The test set can help to avoid the MCP. Obtaining the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best. 

1. Split datasets in training, validation, and test. E.g. 60%, 20% and 20% respectively

2. Train the models

3. Evaluate the models

4. Select the best model

5. Apply the best model to the test dataset

6. Compare the performance metrics of validation and test 

NB: Note that it is possible to reuse the validation data. After selecting the best model (step 4), the validation and training datasets can be combined to form a single training dataset for the chosen model before testing it on the test set. 

 

1.6 Setting up the Environment

We need:

- Python 3.11

- NumPy, Pandas and Scikit-Learn (latest available versions)

- Matplotlib and Seaborn

- Jupyter notebooks (for interactive coding and documentation)

 

Create environment for course

- Install Anaconda

- Create ml-zoomcamp environment

conda create -n ml-zoomcamp python=3.11

- Activate anaconda

              conda activate ml-zoomcamp

- Installing libraries

              conda install numpy pandas scikit-learn seaborn jupyter 

 

1.7 Introduction to NumPy

NumPy is the foundational package for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical operations. 

 

1.8 Linear Algebra Refresher

Linear algebra is essential in ML for operations involving vectors and matrices. Key concepts include: 

Linear Algebra Refresher

              Vector operations

              Multiplication

                             Vector-vector multiplication

                             Matrix-vector multiplication

                             Matrix-matrix multiplication

              Identity matrix

              Inverse 

 

1.9 Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis.

It introduces two main data structures:

              Series: One-dimensional arrays.

              DataFrames: Two-dimensional, tabular data (like Excel spreadsheets).

Pandas simplifies data cleaning, exploration, and manipulation tasks. 

 

1.10 Summary

This module introduced the basics of ML, the differences between rule-based systems and ML, supervised learning, CRISP-DM, and essential tools like NumPy and Pandas.

A foundational understanding of these topics is crucial for moving forward in ML. 

 


ML Zoomcamp - Module 01 (Introduction to Machine Learning)

Module 01: Introduction to Machine Learning   1.1 Introduction to Machine Learning Machine Learning (ML) is the field of study where c...