Statistical Learning Theory

Generalization and Regularization Theory

ResNet-50 on ImageNet (1.2M images, 25M parameters) achieves 75% top-1 accuracy without early stopping. Classical bias-variance tradeoff would predict catastrophic overfitting at such a ratio. Double descent theory (Belkin 2019, Hastie 2019) explains the phenomenon: SGD finds the min-norm interpolant, and overparameterization reduces variance. OpenAI leverages this when scaling GPT: more parameters = better with a large enough dataset.

**Grokking (Power et al. 2022):** a network first memorizes (100% train, 0% test), then after extended training generalizes (100% train, 100% test). Epoch-wise double descent in the wild.
**Weight decay in AdamW:** PyTorch and HuggingFace use AdamW with weight decay 0.01-0.1, strengthening SGD's implicit bias toward low-norm weight solutions. Standard in LLM pre-training.
**Neural Scaling Laws (Hoffmann 2022):** Chinchilla: loss ≈ A/n^α + B/p^β. Double descent explains why the compute-optimal point is n_tokens ≈ 20×n_params.

Bias-variance tradeoff: classical vs modern regime

**ResNet-50 on ImageNet (1.2M images, 1000 classes) achieves 75% top-1 accuracy , double descent theory explains why overfitting from scratch doesn't occur.** Classical bias-variance tradeoff predicts a U-shaped error curve. But neural networks violate this: after the interpolation threshold, error drops again.

**Classical vs modern:** at polynomial degree > n_train, classical methods diverge due to extreme variance. But SGD with implicit regularization in neural networks finds a min-norm solution, which brings variance back down.

What happens to variance in the overparameterized regime when using SGD?

SGD in the overparameterized regime finds the min-norm interpolating solution, which exactly fits training data while having minimum norm. This implicit regularization reduces variance, creating the second descent in double descent.

Double descent: interpolation threshold and modern ML

**Double descent** (Belkin et al., 2019; Hastie et al., 2019): test error has two minima , one in the underparameterized regime, a peak at the interpolation threshold (n_params = n_data), and a second minimum in the overparameterized regime. ResNet on CIFAR-10 exhibits this under added label noise.

**Practical consequences:** early stopping acts as regularization in epoch-wise double descent. Network width is analogous to p in the linear model. Increasing network width improves generalization in the overparameterized regime.

What happens to test error as p/n → ∞ in the overparameterized regime?

In the overparameterized regime (p >> n) the min-norm interpolant has norm ~sqrt(n/p)*||beta*||. Risk decreases as sigma²·n/p + ||beta*||²·n/p → 0 as p → ∞. This is exactly why very wide neural networks generalize.

Implicit regularization: SGD as a regularizer

**Implicit regularization**: the optimization algorithm (SGD, gradient flow) selects a solution with specific properties even without an explicit regularization term. For linear models SGD converges to the min-norm solution; for neural networks it finds solutions with small spectral norm of weight matrices.

**Practical significance:** implicit regularization explains why neural networks generalize without explicit regularization. Weight decay (L2) and batch normalization strengthen the implicit bias. Learning rate schedule affects implicit regularization: higher lr => stronger implicit bias.

What solution does gradient descent from zero initialization converge to in overparameterized linear regression?

Gradient descent from zero initialization stays in row space(X) along the entire trajectory. As t → ∞ it converges to the projection of 0 onto {β: Xβ=y}, which is the min-norm solution β = X^+(y). This is the key result of implicit regularization theory.

Key ideas

**Bias-variance:** MSE = Bias² + Variance + σ². Classical optimum is balance. Modern ML: overparameterization reduces variance via implicit regularization.
**Double descent:** risk peaks at p/n=1 (interpolation threshold), then decreases at p/n >> 1. Also occurs as a function of training epochs (epoch-wise double descent).
**Min-norm interpolant:** when p > n there are infinitely many solutions to Xβ=y. SGD from zero finds the one minimizing ||β||₂.
**Implicit regularization:** the optimizer determines inductive bias. SGD => L2, exponentiated gradient => L1, deep matrix factorization => nuclear norm.
**Practice:** weight decay, learning rate schedule, batch size all influence implicit regularization and final generalization.

Связанные уроки

lt-13-deep-generalization — An exploration of double descent first seen in lesson 13
lt-18-vc-sample-complexity — VC theory explains classical bias-variance but not double descent
lt-17-kernel-methods — NTK explains implicit regularization in deep nets

Bias-variance tradeoff: classical vs modern regime

What happens to variance in the overparameterized regime when using SGD?

Double descent: interpolation threshold and modern ML

What happens to test error as p/n → ∞ in the overparameterized regime?

Implicit regularization: SGD as a regularizer

What solution does gradient descent from zero initialization converge to in overparameterized linear regression?

Key ideas

**Bias-variance:** MSE = Bias² + Variance + σ². Classical optimum is balance. Modern ML: overparameterization reduces variance via implicit regularization.

**Double descent:** risk peaks at p/n=1 (interpolation threshold), then decreases at p/n >> 1. Also occurs as a function of training epochs (epoch-wise double descent).

**Min-norm interpolant:** when p > n there are infinitely many solutions to Xβ=y. SGD from zero finds the one minimizing ||β||₂.

**Implicit regularization:** the optimizer determines inductive bias. SGD => L2, exponentiated gradient => L1, deep matrix factorization => nuclear norm.

**Practice:** weight decay, learning rate schedule, batch size all influence implicit regularization and final generalization.

Generalization and Regularization Theory

Bias-variance tradeoff: classical vs modern regime

Double descent: interpolation threshold and modern ML

Implicit regularization: SGD as a regularizer

Key ideas

Related topics

Связанные уроки

Generalization and Regularization Theory

Bias-variance tradeoff: classical vs modern regime

Double descent: interpolation threshold and modern ML

Implicit regularization: SGD as a regularizer

Key ideas

Related topics

Связанные уроки