Statistical Learning Theory
Online Learning and Regret Bounds
GPT-4 has infinite VC dimension - classical theory predicts catastrophic overfitting. It does not happen. Online learning explains why: SGD is online gradient descent with regret O(√T), and this guarantees generalization without any assumptions on the data distribution.
- **Google Ads:** FTRL-Proximal powers click-through rate prediction at trillions of examples per day. The sparse L1 solution is critical for efficient inference.
- **AdaGrad:** invented at Google in 2011, became the standard for NLP. It is the foundation of Adam and AdamW used to train GPT and LLaMA.
- **AlphaGo MCTS:** the Hedge algorithm runs inside - move selection is an N-expert problem with exponential weights.
- **Online-to-batch theory:** explains why SGD generalizes in modern deep learning without requiring any stationarity of the data distribution.
Предварительные знания
- lt-12-online-regret
- lt-09-pac-bayes
Online Learning as a Game Against Nature
**GPT-4 has literally infinite VC dimension - yet it generalizes.** Classical theory says that should be impossible. Only online learning explains why SGD on transformers works: it is an algorithm with sublinear regret against a sequence of losses, not a statistical estimator.
Online learning protocol: T rounds, no assumptions on the data distribution. At each round t the algorithm picks a prediction wₜ, receives example (xₜ, yₜ), and incurs loss ℓₜ(wₜ). Goal: minimize **regret** - how much worse the algorithm is than the best fixed decision in hindsight.
**Regret:** R_T = Σₜ₌₁ᵀ ℓₜ(wₜ) − min_{w*} Σₜ₌₁ᵀ ℓₜ(w*) An algorithm is called no-regret if R_T/T → 0 as T → ∞. Optimal bound: R_T = O(√T) for convex losses. Online-to-batch conversion: if R_T = O(√T), then for w_avg = (1/T)Σwₜ: E[ℓ(w_avg)] − min ℓ(w) = O(1/√T) This is the same rate as SGD - because SGD *is* online gradient descent!
SGD on neural networks is online gradient descent applied to a sequence of batch losses. What does R_T/T → 0 imply for generalization?
EWA and Hedge: Optimal Expert Selection
N-expert problem: at each round choose one of N 'advisors' and minimize cumulative loss. The **Exponential Weights Algorithm (EWA)** is the optimal algorithm with regret O(√T·log N). It is the mathematical foundation of boosting, Bayesian model averaging, and attention mechanisms.
**EWA (Hedge algorithm):** Weights: wᵢ^(t) = exp(−η·Σₛ<ₜ ℓᵢ(s)) / Z Prediction: distribution p^(t) = w^(t) / Σwᵢ^(t) Regret bound: R_T ≤ (log N)/η + η·T/8 (for losses in [0,1]) Optimal η = √(8·log N / T): R_T ≤ √(T·log N / 2) Remarkable property: the bound depends on log N, not N. Among 10⁶ experts the optimal one is found for the price of √(T·log 10⁶) ≈ √(14T) - nearly free.
The attention mechanism in transformers is EWA in a single round: softmax(QKᵀ/√d) gives expert weights, V gives their predictions. The resemblance is not accidental - attention optimally aggregates information from N tokens.
EWA regret grows as √(T·log N). Why is the dependence on N only logarithmic? What does this mean in practice with N = 2^100 experts?
FTRL: Follow-the-Regularized-Leader
**Follow-the-Regularized-Leader (FTRL):** at each step choose w minimizing the sum of all past losses plus a regularizer R(w). This is a unified framework that subsumes online gradient descent, Hedge, and AdaGrad in a single theorem.
**FTRL update:** wₜ₊₁ = argmin_w [Σₛ≤ₜ ℓₛ(w) + (1/η)·R(w)] **Special cases:** - R(w) = ||w||²/2 → Online Gradient Descent - R(w) = Σᵢ wᵢ log wᵢ (entropy) → Hedge/EWA - R(w) = ||w||₁ → online L1 (sparse predictions) **General regret bound:** R_T ≤ R(w*)/η + η·Σₜ ||∇ℓₜ||²* where ||·||* is the dual norm. AdaGrad adaptively selects η per coordinate: ηᵢ^(t) = η₀/√(Σₛ≤ₜ (∂ℓₛ/∂wᵢ)²) Frequent features get a smaller step - exactly what sparse NLP features need.
AdaGrad uses a different step size per coordinate: small for frequent features, large for rare ones. Why is this the right behavior for NLP tasks with sparse bag-of-words features?
Итоги
- **Online learning:** T rounds against an adversary, no distributional assumptions. Regret R_T = algorithm cumulative loss minus the best fixed w* in hindsight.
- **Online Gradient Descent:** R_T = O(D·G·√T) with step η = D/(G√T). Online-to-batch: w_avg has generalization error O(1/√T).
- **EWA/Hedge:** N experts, R_T ≤ √(T·log N / 2). Logarithmic dependence - the optimal expert among 10⁶ is found almost for free.
- **FTRL:** unifies OGD (L2 regularization) and Hedge (entropy) via argmin of past losses + R(w).
- **AdaGrad:** FTRL with per-coordinate lr = η/√(Σgrad²) - the real foundation of Adam/AdamW in GPT.
Related Topics
Online learning as a bridge between theory and practice:
- Regret bounds (basic) — Online learning and foundational regret analysis
- Boosting theory — Next lesson: AdaBoost as Hedge on weak classifiers
Связанные уроки
- lt-12-online-regret — Deepens regret analysis from the basic course
- lt-09-pac-bayes — EWA is PAC-Bayes in the online setting
- lt-13-deep-generalization — SGD as an online algorithm with implicit regret bound