Statistical Learning Theory

Boosting: weak learner → strong learner

1988. Michael Kearns and Leslie Valiant pose a question that sounds almost like complexity theory: is 'weak' learnability equivalent to 'strong' learnability? If an algorithm is guaranteed to beat random guessing by at least 1%, can a 99%-accurate classifier be assembled from it? In 1990, Robert Schapire answered: yes, constructively. In 1995, Freund and Schapire introduced AdaBoost, which earned them the Gödel Prize in 2003. And the puzzle 'why doesn't AdaBoost overfit?' persisted for years, until margin theory gave the answer.

Applied in practice.

Предварительные знания

(no prerequisites)

VC dimension: measuring the complexity of a hypothesis class

The Kearns-Valiant hypothesis: weak = strong?

1988. Michael Kearns and Leslie Valiant pose a question that sounds almost philosophical: given an algorithm that guesses correctly just 51% of the time, is it possible to build one that is 99% accurate? Or is there an uncrossable gap between 'slightly better than random' and 'nearly perfect'?

Weak learner: algorithm A is a weak PAC-learner for class C if there exists γ > 0 such that for any distribution D and any target c ∈ C, algorithm A with probability ≥ 3/4 returns hypothesis h with error ε(h) ≤ 1/2 - γ. Parameter γ is the 'advantage over random guessing'. Strong learner: same requirement, but ε(h) ≤ ε for arbitrary ε > 0.

Kearns and Valiant conjectured these two notions are equivalent. One direction is straightforward - a strong learner is a weak one (set γ = 1/2 - ε). The hard direction: can a weak learner become strong?

In 1990, Robert Schapire proved: YES. His first boosting algorithm used three calls to the weak learner. The idea: train h1 on the original distribution. Train h2 on a filtered distribution where h1 errs half the time. Train h3 on examples where h1 and h2 disagree. Final classifier: majority vote of h1, h2, h3. The construction is then applied recursively to drive error arbitrarily low.

If there exists a weak PAC-learner for class C with advantage γ > 0, then there exists a strong PAC-learner for C. This is not just an existence proof - it is constructive, with an explicit algorithm.

Why this matters: before 1990, it was unclear whether designing a perfect algorithm from scratch was required, or whether an accurate classifier could be assembled from a collection of almost-random solvers. Schapire showed the latter works. This shifts the entire strategy of machine learning.

A weak learner in the PAC-theoretic sense is an algorithm that:

AdaBoost: reweighting the hard examples

Schapire's 1990 algorithm was theoretically significant but practically awkward: a rigid recursive structure with hard-to-tune sample sizes. In 1995, Yoav Freund and Robert Schapire introduced AdaBoost (EuroCOLT 1995; full version in Journal of Computer and System Sciences, 1997) - an algorithm that runs the weak learner T times, each time adjusting the weights of training examples so that the next learner focuses on the mistakes of the previous ones.

Input: training sample (x1,y1),...,(xm,ym), yi ∈ {-1,+1}, number of rounds T. Initialize: D1(i) = 1/m. For t = 1,...,T: train weak learner on Dt, get ht; compute error εt = Σ_{i: ht(xi)≠yi} Dt(i); compute vote weight αt = 1/2 · ln((1-εt)/εt); update Dt+1(i) = Dt(i) · exp(-αt · yi · ht(xi)) / Zt where Zt is a normalization constant. Final classifier: H(x) = sign(Σt αt ht(x)).

Reading the weight formula: if ht correctly classifies example i (yi · ht(xi) = +1), its weight decreases by factor exp(-αt). If ht misclassifies (yi · ht(xi) = -1), the weight increases. The next learner sees a distribution where the 'hard' examples carry more mass.

Training error theorem: if each weak learner has advantage γt ≥ γ > 0, then after T rounds AdaBoost's training error is at most exp(-2Tγ²). This means with T = O(1/γ² · log(1/δ)) rounds, training error hits zero. But this says nothing about generalization - that is a separate story.

Watch the weight dynamics: at the start, all examples are equal (weight = 1/m). By round 20-30, a handful of 'hard' examples carry 10-100x more than average. The weak learner in each round faces a different problem - AdaBoost effectively solves progressively harder subproblems in sequence.

In AdaBoost, examples that ht misclassified receive:

Margin theory: why AdaBoost does not overfit

An anomaly observed in 1996-1997: AdaBoost keeps improving test accuracy even after training error reaches zero. By classical theory (VC dimension, bias-variance tradeoff) this should not happen - after zero training error, a model should start overfitting. Why does AdaBoost behave differently?

The answer came from Schapire, Singer, Bartlett, and Freund in 1998: what matters is not just whether a prediction is correct, but how confident it is - the margin.

For the final classifier H(x) = sign(Σt αt ht(x)), the margin of example (xi, yi) is: ρi = yi · (Σt αt ht(xi)) / Σt|αt|. This is a normalized confidence score in [-1, +1]. A positive margin means correct classification with some clearance. Minimum margin over the whole sample: ρ = min_i ρi.

The key theorem (Schapire-Singer-Bartlett-Freund 1998): with probability 1-δ over the test sample, for any θ > 0:

Here d is the VC dimension of the weak learners and m is the sample size. Crucially, this bound does not depend directly on the number of rounds T - only on the margin distribution. If AdaBoost continues to increase margins after reaching zero training error, the bound keeps shrinking without overfitting.

The empirically observed (and theoretically grounded) behavior: AdaBoost does not just achieve zero error - it keeps 'pushing' examples away from the decision boundary. Each new round increases the margin for examples with small clearance. This explains why adding rounds beyond zero training error does not hurt - the ensemble becomes more confident, not more complex.

The paradox is resolved: AdaBoost is not a 'more complex' model after adding rounds, in the sense of VC dimension. The VC dimension of the ensemble depends on d and T, but the margin-based bound does not depend on T. If the margin grows, the bound shrinks regardless of T.

Why can AdaBoost keep reducing test error after reaching zero training error?

From AdaBoost to XGBoost: gradient boosting

AdaBoost uses an exponential loss function. Friedman, Hastie, and Tibshirani (Annals of Statistics, 2000) showed that AdaBoost is a special case of a more general principle - gradient boosting, formalized by Friedman in 'Greedy Function Approximation: A Gradient Boosting Machine' (1999 tech report; Annals of Statistics, 2001). Each boosting round adds a function to the ensemble that approximates the negative gradient of the loss in function space.

Instead of AdaBoost's fixed exponential loss, use any differentiable L(y, F(x)). At each step compute pseudo-residuals ri = -∂L/∂F(xi) and train the weak learner on them. This is 'functional gradient descent' - F(x) is updated in the space of functions, not parameters.

XGBoost (Chen and Guestrin, KDD 2016) adds: second-order Taylor expansion for a more accurate loss approximation, L1/L2 regularization on trees, and histogram-based split finding. LightGBM (Microsoft, 2017) tackles the speed problem: instead of sorting in O(n·d) to find splits, it uses histograms in O(n·bins), giving a 20-40x speedup on large datasets.

AdaBoost = gradient boosting with exponential loss L(y, F) = exp(-y·F). The pseudo-residuals for this loss: ri = yi · exp(-yi · F(xi)) - which are exactly the reweighted errors in AdaBoost. Reweighting examples in AdaBoost and computing pseudo-residuals in gradient boosting are the same mechanism in different formulations.

What fundamentally distinguishes gradient boosting from AdaBoost?

Key takeaways

Schapire's theorem 1990: weak and strong PAC-learnability are equivalent - an algorithm with advantage γ > 0 can be 'boosted' to arbitrary accuracy
AdaBoost: T rounds of example reweighting; training error drops exponentially as exp(-2Tγ²)
Margin theory: AdaBoost does not overfit because it keeps increasing margins after zero train error - the generalization bound depends on margin, not T
Gradient boosting = AdaBoost generalized to arbitrary differentiable losses via functional gradient descent; XGBoost and LightGBM are the production implementations

Connections to other concepts

Boosting and SVMs arrived at margin theory independently - both algorithms aim to maximize 'distance from the decision boundary', just by different routes. PAC-learning (lesson lt-01) provides the language for stating the equivalence theorem. VC dimension (lt-04) enters the margin bound through the parameter d of the weak learners. Random Forest is an alternative ensemble method based on bagging, not boosting: trees are trained in parallel and independently.

Related topics — See course

Вопросы для размышления

AdaBoost concentrates large weights on 'hard' examples. In real data, hard examples are often outliers or mislabeled points. How does this affect the algorithm's behavior? Why do XGBoost and LightGBM include regularization by default, which the original AdaBoost does not have?

Связанные уроки

stat-01-sampling