Statistics

Statistics in ML: Theoretical Foundations

Everything an ML engineer does every day - regularization, cross-validation, model selection - has a rigorous statistical foundation. Understanding it turns a bag of tricks into a coherent theory that explains why some approaches work and others fail.

Ridge and Lasso in genomics: p >> n (500k SNPs, 5k patients) - without regularization OLS is meaningless
Hyperparameter search in AutoML: Bayesian optimization treats cross-validation error as the objective function
Double descent in practice: modern neural networks operate in the interpolation regime, completely invalidating classical model-complexity guidelines

Предварительные знания

Causal Inference

Bias-Variance Decomposition

The **bias-variance decomposition** of MSE reads: E[(f̂(x)-y)^2] = [Bias(f̂(x))]^2 + Var(f̂(x)) + sigma^2. Bias = E[f̂(x)] - f(x) is the systematic deviation from the truth (underfitting). Variance = E[(f̂(x) - E[f̂(x)])^2] measures sensitivity to the particular training set (overfitting). sigma^2 = Var(eps) is irreducible noise. The classic claim: bias and variance trade off as model complexity changes.

**Double descent:** in modern overparameterized models the classic bias-variance tradeoff breaks down. When the number of parameters p exceeds n (the interpolation threshold), test error decreases again - a second descent. The explanation: SGD at the interpolation threshold finds the minimum-norm solution (implicit regularization). The classic tradeoff holds for fixed H under ERM; modern neural networks require a more refined theory.

Model A: Bias=0.5, Variance=0.1. Model B: Bias=0.1, Variance=0.6. Noise sigma^2=0.05. Which model has lower MSE?

Regularization as a Bayesian Prior

The **MAP-regularization equivalence**: MAP estimation under a Gaussian prior is Ridge regression. Formally: arg max log P(w|X,y) = arg max [log P(y|X,w) + log P(w)] = arg min [RSS + lambda||w||^2] when w ~ N(0, 1/lambda * I). L2 regularization = Gaussian prior. L1 regularization (Lasso) = Laplace prior: w ~ Laplace(0, 1/lambda). The Laplace distribution has a sharp peak at 0, inducing sparsity.

**Bayesian interpretation of modern deep learning:** weight decay = Gaussian prior on weights. Dropout = approximate Bayesian inference (Gal & Ghahramani 2016). Batch normalization has a regularizing effect through mini-batch noise. Early stopping is equivalent to Ridge regularization in linear networks (Bishop 1995). Most standard DL practices have a coherent Bayesian interpretation.

Ridge regression with lambda=5 and sigma^2=1. What prior does this correspond to from a Bayesian perspective?

Cross-Validation, Bootstrap, and Double Descent

**k-fold cross-validation** partitions the data into k folds; each fold serves as a validation set while the model trains on the remaining k-1 folds. CV estimates the expected generalization error. LOOCV (k=n) is nearly unbiased but costly; for linear models there is an O(n) shortcut. **Bootstrap .632+**: a bias-corrected estimate accounting for the optimism of the in-bag training error. **Double descent**: the test error curve has two minima - one at the classically 'right' complexity and one beyond the interpolation threshold - due to implicit regularization of interpolating solutions.

**Data leakage in CV:** normalizing on the full dataset before cross-validation causes train/test statistics to leak across folds, yielding an overly optimistic estimate. All transformations (StandardScaler, PCA, feature selection) must happen INSIDE each CV fold. In scikit-learn, combining Pipeline with cross_val_score prevents leakage automatically. For time series, use TimeSeriesSplit rather than standard k-fold to avoid future data leaking into the past.

LOOCV produces a nearly unbiased estimate of generalization error but has high variance. Why?

Key Ideas

MSE = Bias^2 + Variance + sigma^2; the optimum balances both terms
Ridge = MAP N(0, sigma^2/lambda); Lasso = MAP Laplace(0,1/lambda); Lasso induces sparsity
CV estimates expected generalization; leakage produces an overly optimistic estimate
LOOCV: nearly unbiased but high variance; 5-10-fold CV is the practical compromise
Double descent: at p >> n, the minimum-norm solution acts as implicit L2 regularization

Statistics in ML and the Full Course

This lesson ties the entire course together. Bias-variance connects to VC theory. Regularization as a prior connects to Bayesian statistics. CV connects to uncertainty quantification. Double descent explains why classical advice breaks down in deep learning.

Vapnik-Chervonenkis Theory — Bias-variance decomposition is a concrete instance of approximation plus estimation error in the VC framework
Bayesian Statistics — Regularization = prior; MAP = point estimate; full Bayesian inference averages over all models

Вопросы для размышления

MAE optimizes the median predictor; MSE optimizes the mean. What does this imply for choosing a loss function when predicting asymmetric quantities like income or response time? How does it relate to the bias-variance decomposition for MAE?
Dropout randomly zeroes neurons during training with probability p. Gal and Ghahramani showed this approximates Bayesian inference. What prior over the weights does this correspond to? How does it explain dropout's regularization effect?
Double descent creates a paradox: increasing model complexity first worsens, then improves test error. How should this change practical recommendations for choosing a neural network architecture? Is there always a 'right' model size to find?

Связанные уроки

aie-36-fine-tuning

Bias-Variance Decomposition

Model A: Bias=0.5, Variance=0.1. Model B: Bias=0.1, Variance=0.6. Noise sigma^2=0.05. Which model has lower MSE?

Regularization as a Bayesian Prior

Ridge regression with lambda=5 and sigma^2=1. What prior does this correspond to from a Bayesian perspective?

Cross-Validation, Bootstrap, and Double Descent

LOOCV produces a nearly unbiased estimate of generalization error but has high variance. Why?

Key Ideas

MSE = Bias^2 + Variance + sigma^2; the optimum balances both terms

Ridge = MAP N(0, sigma^2/lambda); Lasso = MAP Laplace(0,1/lambda); Lasso induces sparsity

CV estimates expected generalization; leakage produces an overly optimistic estimate

LOOCV: nearly unbiased but high variance; 5-10-fold CV is the practical compromise

Double descent: at p >> n, the minimum-norm solution acts as implicit L2 regularization