Statistics
Statistics in ML: Theoretical Foundations
Everything an ML engineer does every day - regularization, cross-validation, model selection - has a rigorous statistical foundation. Understanding it turns a bag of tricks into a coherent theory that explains why some approaches work and others fail.
- Ridge and Lasso in genomics: p >> n (500k SNPs, 5k patients) - without regularization OLS is meaningless
- Hyperparameter search in AutoML: Bayesian optimization treats cross-validation error as the objective function
- Double descent in practice: modern neural networks operate in the interpolation regime, completely invalidating classical model-complexity guidelines
Предварительные знания
Bias-Variance Decomposition
The **bias-variance decomposition** of MSE reads: E[(f̂(x)-y)^2] = [Bias(f̂(x))]^2 + Var(f̂(x)) + sigma^2. Bias = E[f̂(x)] - f(x) is the systematic deviation from the truth (underfitting). Variance = E[(f̂(x) - E[f̂(x)])^2] measures sensitivity to the particular training set (overfitting). sigma^2 = Var(eps) is irreducible noise. The classic claim: bias and variance trade off as model complexity changes.
**Double descent:** in modern overparameterized models the classic bias-variance tradeoff breaks down. When the number of parameters p exceeds n (the interpolation threshold), test error decreases again - a second descent. The explanation: SGD at the interpolation threshold finds the minimum-norm solution (implicit regularization). The classic tradeoff holds for fixed H under ERM; modern neural networks require a more refined theory.
Model A: Bias=0.5, Variance=0.1. Model B: Bias=0.1, Variance=0.6. Noise sigma^2=0.05. Which model has lower MSE?
Regularization as a Bayesian Prior
The **MAP-regularization equivalence**: MAP estimation under a Gaussian prior is Ridge regression. Formally: arg max log P(w|X,y) = arg max [log P(y|X,w) + log P(w)] = arg min [RSS + lambda||w||^2] when w ~ N(0, 1/lambda * I). L2 regularization = Gaussian prior. L1 regularization (Lasso) = Laplace prior: w ~ Laplace(0, 1/lambda). The Laplace distribution has a sharp peak at 0, inducing sparsity.
**Bayesian interpretation of modern deep learning:** weight decay = Gaussian prior on weights. Dropout = approximate Bayesian inference (Gal & Ghahramani 2016). Batch normalization has a regularizing effect through mini-batch noise. Early stopping is equivalent to Ridge regularization in linear networks (Bishop 1995). Most standard DL practices have a coherent Bayesian interpretation.
Ridge regression with lambda=5 and sigma^2=1. What prior does this correspond to from a Bayesian perspective?
Cross-Validation, Bootstrap, and Double Descent
**k-fold cross-validation** partitions the data into k folds; each fold serves as a validation set while the model trains on the remaining k-1 folds. CV estimates the expected generalization error. LOOCV (k=n) is nearly unbiased but costly; for linear models there is an O(n) shortcut. **Bootstrap .632+**: a bias-corrected estimate accounting for the optimism of the in-bag training error. **Double descent**: the test error curve has two minima - one at the classically 'right' complexity and one beyond the interpolation threshold - due to implicit regularization of interpolating solutions.
**Data leakage in CV:** normalizing on the full dataset before cross-validation causes train/test statistics to leak across folds, yielding an overly optimistic estimate. All transformations (StandardScaler, PCA, feature selection) must happen INSIDE each CV fold. In scikit-learn, combining Pipeline with cross_val_score prevents leakage automatically. For time series, use TimeSeriesSplit rather than standard k-fold to avoid future data leaking into the past.
LOOCV produces a nearly unbiased estimate of generalization error but has high variance. Why?
Key Ideas
- MSE = Bias^2 + Variance + sigma^2; the optimum balances both terms
- Ridge = MAP N(0, sigma^2/lambda); Lasso = MAP Laplace(0,1/lambda); Lasso induces sparsity
- CV estimates expected generalization; leakage produces an overly optimistic estimate
- LOOCV: nearly unbiased but high variance; 5-10-fold CV is the practical compromise
- Double descent: at p >> n, the minimum-norm solution acts as implicit L2 regularization
Statistics in ML and the Full Course
This lesson ties the entire course together. Bias-variance connects to VC theory. Regularization as a prior connects to Bayesian statistics. CV connects to uncertainty quantification. Double descent explains why classical advice breaks down in deep learning.
- Vapnik-Chervonenkis Theory — Bias-variance decomposition is a concrete instance of approximation plus estimation error in the VC framework
- Bayesian Statistics — Regularization = prior; MAP = point estimate; full Bayesian inference averages over all models
Вопросы для размышления
- MAE optimizes the median predictor; MSE optimizes the mean. What does this imply for choosing a loss function when predicting asymmetric quantities like income or response time? How does it relate to the bias-variance decomposition for MAE?
- Dropout randomly zeroes neurons during training with probability p. Gal and Ghahramani showed this approximates Bayesian inference. What prior over the weights does this correspond to? How does it explain dropout's regularization effect?
- Double descent creates a paradox: increasing model complexity first worsens, then improves test error. How should this change practical recommendations for choosing a neural network architecture? Is there always a 'right' model size to find?