Machine Learning Techniques - 2

ML techniques and Good practices

A: All models are wrong

In-sample and Out-sample

In-sample error: Calculated from data used for training.

Out-sample error: Calculated from unseen data (not used in training)

2 sampling with n = 10

In-sample and Out-sample

Importance of validating using unseen data.

2 sampling with n = 50

In-sample and Out-sample

Difference are lower with larger samples.

2 sampling with n = 500

Mean Squared Error Decomposition

\(\text{MSE} = \mathbb{E}[(Y - \hat{Y})^2]\)

This can be further decomposed as:

\[ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} \]

Where:

\(\text{Bias}(\hat{Y}) = \mathbb{E}[\hat{Y}] - Y\)
\(\text{Variance} = \mathbb{E}[(\hat{Y} - \mathbb{E}[\hat{Y}])^2]\)

Under and Overfitting

degree 1: underfit
- ? data true pattern
- Higher Bias
degree 8-9: overfit
- random error pattern
- Higher Variance

Under, Over and “Okay” -fitting

If decomposed we would see

increased variance (out) on high degree
increased bias (out) on low degree
We need a “bias-variance tradeoff”

The Okay-fit is where the model:

Learns data pattern
Can generalize on unseen data
Does not learn the random error pattern

WARNING: Bias variance decomposition does not always make sense.

Mandatory ML workflow

Training data → in-sample error
Test data → out-sample error
Test data should be without bias
No bias = the same error model
Not required but recommended: adversarial datasets

Importance of bias in datasets

What is the pattern of wrong prediction ?

Importance of Explainable AI

Testing is good but explaining is better:

Importance of adversarial datasets

IMPORTANT: Adversarial datasets should come after testing.

Proper testing is done with the same underlying error.

Why ? We are looking at different bias:
- related to over-learning on the random error
- not related with an error but with new data patterns
Adversarial datasets are datasets where we can expect a new pattern in data
Examples:
- Wolf Images from Zoo
- People from other countries
- Process in another factory

ML workflow with model selection

VALIDATION IS NOT TESTING

Model selection
Hyperparameter tuning
Training parameters
Introduction of validation data
Training split
Re-training the selected model

Data-Driven vs Theory Driven

Data-driven
- Require lots of data
- Leverage lots of algorithms
- Require lots of computing power
- Importance of Testing and Validation Framework
- Hardly explainable
- Optimizing
- Deep Learning / Scikit-learn Pipelines
- Business intelligence / NLP / Image

Theory-driven
- Can work with few data
- Understanding of a problem
- Require less computing power
- Limited importance of Testing and Validation
- Easily explainable
- Modelling
- Field-specific methods and algorithms
- Aerodynamics / Molecule modeling / Genomics

B: Learning and Evaluation

ML tasks for prediction

Classification
Regression
Clustering
Association

Clustering examples

means: find groups that minimize the within-cluster sum of squares (distance to the centroid)
Clustering
- Compute and distance matrix (Euclidean)
- Apply an agglomerative clustering (neighbor joining)

ML tasks for data transformation

Data Encoding: e.g (one hot, Ordinal)
Data Embedding:
- Vector-representation of complex object
- ex: Word2Vec / Encoder deep learning architecture
Data projection: Project onto another space
- Often based on dimensionality reduction techniques

Learning strategy

Supervised learning
Unsupervised
Reinforcement learning
Genetic Algorithm
Transfer learning

Evaluation / Binary Classification

source: wikipédia

Summary metrics for Binary outcome

Balanced single score : example F1 \[ F1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \]
AUC: Area under curve
ROC: Recall = f(FPR) = f(1-specificity)
PRC: Precision = f(Recall)

Classification metrics

Confusion Matrix
Rand Index

Regression metrics

Correlation (pearson / spearman)
Distance metrics

Clustering Metrics

Silhouette Score
Davies-Bouldin Index

Association Metrics

Support measures the frequency of the rule in the dataset. \[ \text{Support}(A \Rightarrow B) = \mathbb{P}(A \cap B) \]
Confidence measures how often items in \(B\) appear in transactions that contain \(A\).

\[ \text{Confidence}(A \Rightarrow B) = \mathbb{P}(B | A) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(A)} \]

Lift: how much more often \(A\) and \(B\) occur together than expected if they were statistically independent.

\[ \text{Lift}(A \Rightarrow B) = \frac{\mathbb{P}(B | A)}{\mathbb{P}(B)} = \frac{\mathbb{P}(B \cap A)}{\mathbb{P}(A) \mathbb{P}(B)} \]

Projection / Mapping metrics

Continuity: Local neighborhoods preserved ?
Mean K-Nearest Neighbors (KNN) Error: distance to centrois before and after the projection ?

Global Structure Preservation

Correlation/Error over distance matrix
Percentage of Variance Explained

C: ML techniques

Feature Engineering

Feature Normalization
Feature Selection/Extraction
Feature Transformation
Feature Categorization
Feature Embedding/Encoding

Dimentionality Reduction

Curse of dimensionality: more features = more parameters

PCA: Principal component analysis
Principal components (PCs) are linear expressions of features
PCs fitted so that sample variance is maximal on the first components

Dimensionality reduction

2 is nearly enough…

Here samples are
Target not used in PCA

Working with Non-Regular Data

Imbalanced Data
- Population bias or sample bias ?
- Downsample ?
- Weighting for training and
- Use adapted algorithm (Tree-based)
Data with Uncertainties
- Leverage statitical models (Bayesian)
- Sampling
- Averaging
Missing Data
- Remove samples
- Use algorithm tolerating missing data
- Predict missing data

Model Engineering Training Setup

Loss function and Weights
Solver / Optimizer
Weighting
Other Options (e.g. tree/splits)

Validation and test

Training, Validation and Test Dataset
Cross-validation (K-Fold, Stratification) [scikit-learn]
Challenging or Adversarial Test Dataset

k-Fold cross-validation

Class and group in cross-validation

Class: target
Group: feature

Regularization

Loss penalty
- L1, L2 and Elastic-Net
- Max Norm Regularization
(Multivariate) Boundaries for optimization
Early Stopping (for complex models)
Drop Out (In Deep Learning)

Ensemble Learning Methods

Principle: Aggregate Predictions
Bagging : Bootstrap aggregating; averages predictions to reduce overfitting. (Random Forest)
Boosting : Sequentially focuses on misclassified instances to improve accuracy. (Gradient Boosting)
Voting : Multiple models vote on output; majority or average wins. (Decision Forest)
Stacking : Learns from model predictions to make a final prediction. (PCA and then regression)

Stochastic Methods

Changing the seed in a Random algorithm (SGD)
Changing the seed for random initialization
Boostraping
Others: Monte Carlo simulation (Bayesian sampling)

Machine Learning Techniques - 2

Table of Contents

A: All models are wrong

In-sample and Out-sample

In-sample and Out-sample

In-sample and Out-sample

Mean Squared Error Decomposition

Under and Overfitting

Under, Over and “Okay” -fitting

Mandatory ML workflow

Importance of bias in datasets

Importance of Explainable AI

Importance of adversarial datasets

ML workflow with model selection

Data-Driven vs Theory Driven

B: Learning and Evaluation

ML tasks for prediction

Clustering examples

ML tasks for data transformation

Learning strategy

Evaluation / Binary Classification

Summary metrics for Binary outcome

Classification metrics

Regression metrics

Clustering Metrics

Association Metrics

Projection / Mapping metrics

Global Structure Preservation

C: ML techniques

Feature Engineering

Dimentionality Reduction

Dimensionality reduction

Working with Non-Regular Data

Model Engineering Training Setup

Validation and test

k-Fold cross-validation

Class and group in cross-validation

Regularization

Ensemble Learning Methods

Stochastic Methods