Mathematical foundations to Modelling and ML
Cumulative distribution function: \(F_X(x) = \mathbb{P}(X \leq x)\)
Density function for continuous variables:
Expectation:
The nth raw moment: \(\mu'_n = E(X^n)\)
The nth central moment: \(\mu_n = E[(X-E(X))^n]\)
Variance: second central moment \(\mu_2\)
Standard deviation \(\sigma\) is defined such as \(\sigma^2 = \text{Var}(X)\)
The nth standardized moment \(\gamma_k = \frac{\mu_k}{(\sigma)^k}\)
\(\gamma_1 = 0\), \(\gamma_2 = 1\)
Skewness the 3rd standardized moment:
Kurtosis the 3rd standardized moment:
Moment | Value |
---|---|
\(E(X)\) | \(\mu\) |
\(\text{Var}(X)\) | \(\sigma^2\) |
Skewness | 0 |
Kurtosis | 3 |
\(f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)
Moment | Value |
---|---|
\(E(X)\) | \(\frac{\alpha}{\lambda}\) |
\(\text{Var}(X)\) | \(\frac{\alpha}{\lambda^2}\) |
Skewness | \(\frac{2}{\sqrt{\alpha}}\) |
Kurtosis | \(3 + \frac{6}{\alpha}\) |
\(f(x) = \frac{\lambda^\alpha x^{\alpha-1}}{\Gamma(\alpha)} e^{-\lambda x}\)
Definition As the number of independent trials increases, the sample average converges to the expected value.
For any \(\epsilon > 0\)
Model Training: Given more training data, the model’s performance on the training data (like the loss) tends to stabilize, providing a more reliable estimate of its generalization to unseen data (if no bias).
Evaluation Metrics: As we evaluate a model on more samples, metrics like accuracy, F1 score, or Mean Squared Error will converge to a more consistent value, representing the model’s true performance.
The distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a normal (Gaussian) distribution, regardless of the original distribution of the variables.
Given \(X_1, X_2, ...\) independent and identically distributed with mean \(\mu\) and variance \(\sigma^2\)
\(\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{n \to \infty} \mathcal{N}(0,1)\)
Two types of errors in hypothesis testing (1):
A test statistic is a
In case where \(X \sim \mathcal{N}(0,1)\), then \(X \sim \mathcal{Student}(\nu)\) (2)
\(\nu\) is the degrees of freedom (\(n-1\)) \(\to\) shape of the student-distribution.
The p-value measures the evidence against a null hypothesis.
Mathematically:
It is the probability of observing a test statistic:
General guideline:
Parameter(s) in a model that do not vary across sampling.
Example: Linear regression with fixed effects
\[ \left\{ \begin{array}{ll} y_{i} = \alpha + \beta x_{i} + \epsilon_{i} \\ \epsilon_{i} \sim \mathcal{N}(0,1) \end{array} \right. \] Where \(\alpha\) and \(\beta\) are fixed effects.
All the sampling variation is absobed in the error.
\(\rightarrow\) Mostly used in ML
Parameters are random variables
Example: Linear regression with random effects:
\[ \left\{ \begin{array}{ll} y_{it} = \alpha_i + \beta_i x_{it} + \epsilon_{it} \\ \alpha_i \sim \mathcal{N}(\mu_\alpha, \tau^2_\alpha) \\ \beta_i \sim \mathcal{N}(\mu_\beta, \tau^2_\beta) \\ \epsilon_{it} \sim \mathcal{N}(0, \sigma^2) \end{array} \right. \]
A type of mixed-effects model where data is nested within multiple levels of groups.
\[ \left\{ \begin{array}{ll} y_{ijkt} = \alpha_i + \beta_j x_j + \gamma_k z_k + \epsilon_{ijkt} \\ \alpha_i \sim \mathcal{N}(\mu_{\alpha}, \sigma) \\ \beta_j \sim \mathcal{N}(\mu_{\beta}, \sigma) \\ \gamma_k \sim \mathcal{N}(\mu_{\gamma}, \sigma) \\ \epsilon_{ijkt} \sim \mathcal{N}(0, 1) \end{array} \right. \]
The likelihood is a probability defined for a model \[ \mathbb{P}_\mathcal{M}( y | \theta) \]
with:
The likelihood is a function of the parameters \(\mathcal{L}_{\mathcal{M}}(\theta) = \mathcal{L}(\theta)\)
\[\hat{\theta}_{MLE} = \arg \max_{\theta} \mathcal{L}(\theta)\]
Fixed effect model and normally distributed error
\[ \left\{ \begin{array}{ll} y_{i} = f_{\theta}(xi) + \epsilon_{i} \\ \epsilon_{i} \sim \mathcal{N}(0,\sigma) \end{array} \right. \]
From the normal density function: \(\mathbb{P}(\epsilon_i | \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(\epsilon_i)^2}{2\sigma^2} \right)\)
or \(\epsilon_i = y_i - \hat{y_i}\)
Then : \(\mathcal{L}(\theta) = \prod_{i=1}^{n} \mathbb{P}(\epsilon_i | \theta)\)
\[ log(\mathcal{L}(\theta)) = - \frac{n}{2} log(2 \pi \sigma) - \frac{1}{2 \sigma^2} \sum_i{(y_i - \hat{y_i})^2} \]
\[ log(\mathcal{L}(\theta)) \propto - \sum_i{(y_i - \hat{y_i})^2} \]
\[ \arg \max_{\theta} \mathcal{L}(\theta) = \arg \min_{\theta} \sum_i{(y_i - \hat{y_i})^2} \]
\(\to\) Least squared is equivalent to MLE in such problem
Used to balance fit and complexity. Two common criteria:
Where \(L\) is likelihood, \(k\) is number of parameters, and \(n\) is sample size.
Given:
\(\mathcal{L}_1\): likelihood under the full (or complex) model.
\(\mathcal{L}_0\): likelihood under the restricted (or simpler) model.
Test statistic: \(D = -2(\log(\mathcal{L}_0) - \log(\mathcal{L}_1))\)
Wilks’ Theorem \(D \xrightarrow{n \to \infty} \chi^2\)
A general linear problem can be defined as: \[Y = XB + U\]
Where:
To find the optimum of \(SSR(B)\), we try to solve \(\Delta SSR(B) = 0\), ie the equation:
\[X^TY = X^T X B\]
If \(X^T X\) is invertible, then
\[\hat{B} = (X^T X)^{-1} X^TY\]
For \(X\) to be invertible, it needs to be a full rank matrix:
from scipy.linalg import inv, det, norm
function newton_optimization(grad, hess, x0, tol=1e-6, max_it=1000):
"""
grad: function for computing the gradient vector of shape (n, 1)
hess: function for computing the hessian matrix of shape (n, n)
x0: initial guess of shape (n, 1)
tol: stopping criterion for the difference between consecutive x values
max_it: maximum number of iterations allowed
"""
x = x0
for it in range(max_it):
x_grad = grad(x)
x_hess = hess(x)
assert(det(x_hess) != 0)
x = x + np.matmul(inv(x_hess), x_grad)
if norm(x_grad) < tol:
return x # Found the optimum
x_current = x_next
raise Error("Maximum iterations reached without convergence!")
function nelder_mead(f, s0, coef, max_it=1000, tol=1e-6):
"""
f: target function to be minimized
s0: list of n+1 initial guesses (vertices of the initial simplex)
coef: reflection, contraction, expansion, shrink coefficients
max_it: maximum number of iterations allowed
tol: stopping criterion for the difference in function values
"""
s = s0
for it in range(max_it):
s.sort(key=lambda v: f(v)) # Sorting from low to high
centroid = [sum(v[i] for v in s[:-1]) # Without the worst
refl_v = centroid + coef[0] * (centroid - s[-1]) # Reflect the worst
if f(s[0]) <= f(refl_v) < f(s[-2]): # if good but not best
simplex[-1] = refl_v # replace the worst
elif f(refl_v) < f(s[0]): # elif best, try expansion
expe_v = centroid + coef[1] * (refl_v - centroid)
if f(expe_v) < f(s[0]): simplex[-1] = expe_v
else: simplex[-1] = refl_v
else: # contract the worst or shrink all others
cont_v = centroid + coef[2] * (s[-1] - centroid)
if f(cont_v) < f(s[-1]): s[-1] = cont_v
else: s = [s[0] + coef[3] * (s[i] - s[0]) for i in range(1, len(s))]
if abs(f(s[-1]) - f(s[0])) < tol:
return s[0]
Source: WIKIPEDIA
function gradient_descent(f_gradient, x0, lr, max_it=1000, tol=1e-6):
"""
grad: function for computing the gradient vector of shape (n, 1)
x0: initial guess of shape (n, 1)
lr: learning rate (step size)
max_it: maximum number of iterations allowed
tol: stopping criterion for the difference between consecutive x values
"""
x = x0
for it in range(max_it):
x_grad = grad(x)
newton_step = lr * grad(x)
x = x - newton_step
if norm(newton_step) < tolerance: # OR ABSOLUTE MAXIMUM
return x # Found the minimum
raise Error("Maximum iterations reached without convergence!")
\[ \mathbb{P}(\theta | y) = \mathcal{L}(\theta) \mathbb{P}(\theta)\]
Playlist on youtube
Features: Variables that are collected for each “observation”. Nature of those variables can be diverse (E.g. measure, design).
X: Often refers to the matrix of feature with lines as observations and columns as features.
Data: Data can refers to X or to a larger entity with the target value (variable to predict)
Y typically represents the variable or the output in statistical modeling and machine learning.
Targets: The actual values of Y in the dataset. These are what the model aims to predict or reproduce.
Predicted Values (\(\hat{Y}\)): The values of Y as estimated or predicted by the model based on the features, X.
Transformed Values: The values of Y as estimated or predicted by the model based on the features, X when there is no meaning for prediction as there are no target value (unsupervised)
M or F stands for the model or function that is being trained or used for predictions/transformations.
Model: Represents the specific algorithm or method being used to learn from data and make predictions. Examples include linear regression, decision tree, neural network, etc.
Parameters: Part of the model that is being trained from the data through optimization to maximize the objective function.
Hyperparameters: Part of the model that is set before the optimization of parameters. Nevertheless, hyperparameters can be learned thanks to nested optimizations using validation techniques.
Predict: The process of using a model to estimate or forecast the output (Y) given data (X). Purpose of training a model is to improve the prediction.
Transform: The process of using a model to transform given data (X) into transformed data (Y) when there is not an explicit target for example in the case of dimensionality reduction.
Fit: The process of training a model on a given dataset.
Infer:
Probability theory is essential
Statistical models at the core of ML
Likelihood function and maximization
From likelihood to Least Square
Solving through optimization
For fruitful discussions and corrections.