ML techniques and Good practices
In-sample error: Calculated from data used for training.
Out-sample error: Calculated from unseen data (not used in training)
2 sampling with n = 10
Importance of validating using unseen data.
2 sampling with n = 50
Difference are lower with larger samples.
2 sampling with n = 500
\(\text{MSE} = \mathbb{E}[(Y - \hat{Y})^2]\)
This can be further decomposed as:
\[ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} \]
Where:
If decomposed we would see
The Okay-fit is where the model:
WARNING: Bias variance decomposition does not always make sense.
Training data → in-sample error
Test data → out-sample error
Test data should be without bias
No bias = the same error model
Not required but recommended: adversarial datasets
What is the pattern of wrong prediction ?
Testing is good but explaining is better:
IMPORTANT: Adversarial datasets should come after testing.
Proper testing is done with the same underlying error.
VALIDATION IS NOT TESTING
Data Encoding: e.g (one hot, Ordinal)
Data Embedding:
Data projection: Project onto another space
source: wikipédia
Balanced single score : example F1 \[ F1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \]
AUC: Area under curve
ROC: Recall = f(FPR) = f(1-specificity)
PRC: Precision = f(Recall)
Support measures the frequency of the rule in the dataset. \[ \text{Support}(A \Rightarrow B) = \mathbb{P}(A \cap B) \]
Confidence measures how often items in \(B\) appear in transactions that contain \(A\).
\[ \text{Confidence}(A \Rightarrow B) = \mathbb{P}(B | A) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(A)} \]
\[ \text{Lift}(A \Rightarrow B) = \frac{\mathbb{P}(B | A)}{\mathbb{P}(B)} = \frac{\mathbb{P}(B \cap A)}{\mathbb{P}(A) \mathbb{P}(B)} \]
Continuity: Local neighborhoods preserved ?
Mean K-Nearest Neighbors (KNN) Error: distance to centrois before and after the projection ?
Correlation/Error over distance matrix
Percentage of Variance Explained
Curse of dimensionality: more features = more parameters
2 is nearly enough…
Loss penalty
(Multivariate) Boundaries for optimization
Early Stopping (for complex models)
Drop Out (In Deep Learning)