Statistical Learning Theory
Boosting: from Weak to Strong Learner
OpenAI in 2024 published 'weak-to-strong generalization': GPT-2 trains GPT-4 better than human annotators. This is literally boosting theory from 1997 applied to alignment. XGBoost wins Kaggle competitions. LightGBM is the industry standard for tabular ML. Boosting is living theory with direct applications.
- **XGBoost/LightGBM:** the standard for fraud detection, click-through rate prediction, and pricing at Uber/Airbnb/Booking. In the 2016 KDDCup, 10 of 10 winning solutions used XGBoost.
- **OpenAI weak-to-strong:** GPT-2 as a 'weak supervisor' for GPT-4 through a boosting-like iterative process. Theoretical justification comes from margin bounds.
- **Gradient Boosted Trees in Search:** Yandex MatrixNet (2009), Google BM25+GBDT ranker - search ranking on gradient boosting for decades.
- **AdaBoost history:** Freund and Schapire received the Gödel Prize in 2003 for the algorithm that directly inspired ensemble methods used everywhere in industry today.
Предварительные знания
- lt-15-online-learning
- lt-11-margin-bounds
Weak Classifier + Hedge = Strong Generalizer
**OpenAI uses weak-to-strong generalization to align GPT-4 via GPT-2.** This is a direct application of boosting theory: a weak teacher (GPT-2) trains a stronger student (GPT-4) through an iterative process. The original AdaBoost from 1997 anticipated this principle.
**Weak classifier:** h: X → {−1, +1} with accuracy 50% + γ (slightly better than random, γ > 0). **Boosting:** combine T weak classifiers into one strong H with accuracy → 1. Schapire (1990) proved it is theoretically possible. Freund-Schapire (1997) found the practical algorithm: AdaBoost.
**AdaBoost protocol:** 1. Initialize: uniform weights D₁(i) = 1/n 2. For t = 1, ..., T: a. Train hₜ on the weighted sample with weights Dₜ b. Error: εₜ = Σᵢ Dₜ(i)·1[hₜ(xᵢ) ≠ yᵢ] c. αₜ = (1/2)·log((1−εₜ)/εₜ) (classifier weight) d. Update: Dₜ₊₁(i) ∝ Dₜ(i)·exp(−αₜ·yᵢ·hₜ(xᵢ)) 3. H(x) = sign(Σₜ αₜ·hₜ(x)) Training error: E_train ≤ exp(−2·Σₜ γₜ²) → decreases exponentially!
AdaBoost increases weights for misclassified examples, which resembles importance sampling. What is the formal connection between AdaBoost and the Hedge algorithm from lesson 15?
Why Boosting Generalizes After Zero Training Error
AdaBoost reaches zero training error in ~20 rounds. Classical VC theory predicts: the VC dimension of the ensemble grows with T, so generalization should worsen. In practice, test accuracy keeps improving for 1000 more rounds. The explanation: **margin theory**.
**Margin-based bound for boosting:** Margin: ρᵢ = yᵢ · H(xᵢ) / ||α||₁ - how 'confidently' H classifies xᵢ Schapire-Singer-Bartlett-Lee (1998): E_test ≤ Pr_train[margin ≤ θ] + O(√(d·log²T / (m·θ²))) Key: the first term. As long as AdaBoost keeps running, the margin of most examples grows. The bound shrinks even though the VC dimension of the ensemble is increasing. **Connection to Hedge:** AdaBoost is Hedge on training samples. Examples = 'experts', rounds = 'time'. Regret → margin increases.
XGBoost and LightGBM are gradient boosting, not AdaBoost. They minimize an arbitrary differentiable loss via functional gradient descent. Gradient boosting wins Kaggle competitions on tabular data - neural networks lose to it on structured tables.
The margin keeps growing after zero training error is reached. What does this mean geometrically? How does it relate to SVM, which also maximizes the margin?
Gradient Boosting: Functional Gradient Descent
AdaBoost minimizes exponential loss. **Gradient Boosting** (Friedman 2001) generalizes: at each step add hₜ that best approximates the **negative gradient** of the current loss with respect to the predictions. Any differentiable loss, any base learner.
**Gradient Boosting algorithm:** F₀(x) = argmin_c Σᵢ L(yᵢ, c) For t = 1, ..., T: 1. Pseudo-residuals: rᵢₜ = −∂L(yᵢ, F(xᵢ))/∂F(xᵢ)|_{F=Fₜ₋₁} 2. Fit a regression tree hₜ to the rᵢₜ 3. γₜ = argmin_γ Σᵢ L(yᵢ, Fₜ₋₁(xᵢ) + γ·hₜ(xᵢ)) 4. Fₜ(x) = Fₜ₋₁(x) + ν·γₜ·hₜ(x) (ν - shrinkage / learning rate) For MSE: rᵢₜ = yᵢ − F(xᵢ) - ordinary residuals! For log-loss: rᵢₜ = yᵢ − pᵢ - logistic regression residuals.
In gradient boosting, shrinkage (learning rate ν < 1) deliberately makes each step smaller. Why? How does this relate to regularization and overfitting?
Итоги
- **AdaBoost:** T weak classifiers with error 50%−γ → training error ≤ exp(−2γ²T). Weights update exponentially.
- **AdaBoost = Hedge:** weights ∝ exp(−αₜ·yᵢ·hₜ(xᵢ)) - this is literally exponential weights on training samples.
- **Margin theory:** test error ≤ Pr[margin ≤ θ] + O(√log T / θ√n). The margin grows after zero training error - the bound keeps shrinking.
- **Gradient Boosting:** hₜ approximates the negative gradient of the loss w.r.t. predictions. MSE → residuals, log-loss → logistic residuals.
- **Shrinkage:** lr < 1 requires more trees but provides better regularization. Depth 3–6 + T = 100–1000 + lr = 0.01–0.1 are standard hyperparameters.
Related Topics
Boosting theory and practice:
- Margin Bounds — Theoretical explanation of why ensembles generalize - margin theory
- Kernel Methods — Next lesson: RKHS, representer theorem, SVM as margin maximizer
Связанные уроки
- lt-11-margin-bounds — Margin theory explains why boosting generalizes
- lt-15-online-learning — AdaBoost is the Hedge algorithm applied to training samples
- lt-04-vc-dimension — VC dimension of the ensemble grows, yet margin bounds remain tight