Statistical Learning Theory

Generalization and Regularization Theory

ResNet-50 on ImageNet (1.2M images, 25M parameters) achieves 75% top-1 accuracy without early stopping. Classical bias-variance tradeoff would predict catastrophic overfitting at such a ratio. Double descent theory (Belkin 2019, Hastie 2019) explains the phenomenon: SGD finds the min-norm interpolant, and overparameterization reduces variance. OpenAI leverages this when scaling GPT: more parameters = better with a large enough dataset.

  • **Grokking (Power et al. 2022):** a network first memorizes (100% train, 0% test), then after extended training generalizes (100% train, 100% test). Epoch-wise double descent in the wild.
  • **Weight decay in AdamW:** PyTorch and HuggingFace use AdamW with weight decay 0.01-0.1, strengthening SGD's implicit bias toward low-norm weight solutions. Standard in LLM pre-training.
  • **Neural Scaling Laws (Hoffmann 2022):** Chinchilla: loss ≈ A/n^α + B/p^β. Double descent explains why the compute-optimal point is n_tokens ≈ 20×n_params.

Bias-variance tradeoff: classical vs modern regime

**ResNet-50 on ImageNet (1.2M images, 1000 classes) achieves 75% top-1 accuracy , double descent theory explains why overfitting from scratch doesn't occur.** Classical bias-variance tradeoff predicts a U-shaped error curve. But neural networks violate this: after the interpolation threshold, error drops again.

**Classical vs modern:** at polynomial degree > n_train, classical methods diverge due to extreme variance. But SGD with implicit regularization in neural networks finds a min-norm solution, which brings variance back down.

What happens to variance in the overparameterized regime when using SGD?

SGD in the overparameterized regime finds the min-norm interpolating solution, which exactly fits training data while having minimum norm. This implicit regularization reduces variance, creating the second descent in double descent.

Double descent: interpolation threshold and modern ML

**Double descent** (Belkin et al., 2019; Hastie et al., 2019): test error has two minima , one in the underparameterized regime, a peak at the interpolation threshold (n_params = n_data), and a second minimum in the overparameterized regime. ResNet on CIFAR-10 exhibits this under added label noise.

**Practical consequences:** early stopping acts as regularization in epoch-wise double descent. Network width is analogous to p in the linear model. Increasing network width improves generalization in the overparameterized regime.

What happens to test error as p/n → ∞ in the overparameterized regime?

In the overparameterized regime (p >> n) the min-norm interpolant has norm ~sqrt(n/p)*||beta*||. Risk decreases as sigma²·n/p + ||beta*||²·n/p → 0 as p → ∞. This is exactly why very wide neural networks generalize.

Implicit regularization: SGD as a regularizer

**Implicit regularization**: the optimization algorithm (SGD, gradient flow) selects a solution with specific properties even without an explicit regularization term. For linear models SGD converges to the min-norm solution; for neural networks it finds solutions with small spectral norm of weight matrices.

**Practical significance:** implicit regularization explains why neural networks generalize without explicit regularization. Weight decay (L2) and batch normalization strengthen the implicit bias. Learning rate schedule affects implicit regularization: higher lr => stronger implicit bias.

What solution does gradient descent from zero initialization converge to in overparameterized linear regression?

Gradient descent from zero initialization stays in row space(X) along the entire trajectory. As t → ∞ it converges to the projection of 0 onto {β: Xβ=y}, which is the min-norm solution β = X^+(y). This is the key result of implicit regularization theory.

Key ideas

  • **Bias-variance:** MSE = Bias² + Variance + σ². Classical optimum is balance. Modern ML: overparameterization reduces variance via implicit regularization.
  • **Double descent:** risk peaks at p/n=1 (interpolation threshold), then decreases at p/n >> 1. Also occurs as a function of training epochs (epoch-wise double descent).
  • **Min-norm interpolant:** when p > n there are infinitely many solutions to Xβ=y. SGD from zero finds the one minimizing ||β||₂.
  • **Implicit regularization:** the optimizer determines inductive bias. SGD => L2, exponentiated gradient => L1, deep matrix factorization => nuclear norm.
  • **Practice:** weight decay, learning rate schedule, batch size all influence implicit regularization and final generalization.

Related topics

Generalization theory bridges classical statistics and modern deep learning:

  • Deep generalization paradox — Lesson 13: empirical observations of Zhang 2017
  • VC dimension and PAC — Previous lesson: classical theory

Связанные уроки

  • lt-13-deep-generalization — An exploration of double descent first seen in lesson 13
  • lt-18-vc-sample-complexity — VC theory explains classical bias-variance but not double descent
  • lt-17-kernel-methods — NTK explains implicit regularization in deep nets
Generalization and Regularization Theory

0

1

Sign In