Probability Theory
Concentration of Measure
How do you prove that SGD generalizes? Hoeffding's inequality gives $P(\bar X_n - \mu > \varepsilon) \leq e^{-2n\varepsilon^2/(b-a)^2}$ - the probability of a 'bad' sample mean decays exponentially fast. This is far stronger than the CLT and is the foundation of theoretical machine learning.
- Generalization guarantees in ML: Hoeffding + union bound gives $P(\sup_{h\in\mathcal{H}} |\hat R(h) - R(h)| > \varepsilon) \leq 2|\mathcal{H}| e^{-2n\varepsilon^2}$ - the foundation of PAC learning bounds for finite hypothesis classes.
- Random forests: Azuma's inequality justifies the stability of estimates - a single training point perturbation changes the prediction by at most $c/\sqrt{n}$ with exponential probability.
- Differential privacy: the Gaussian mechanism uses concentration of Gaussian noise to derive privacy guarantees. The $\varepsilon$-$\delta$-DP parameters follow directly from tail bounds.
Цели урока
- Prove and apply Markov, Chebyshev, and Hoeffding inequalities for tail bounds
- State Azuma's inequality for martingales and apply the bounded differences method
- Construct union bounds and explain their role in generalization guarantees
Предварительные знания
- Expectation and variance
- Moment generating functions (MGF)
- Martingales and conditional expectation
Hierarchy of concentration inequalities
Markov's inequality: $P(X \geq t) \leq \mathbb{E}[X]/t$ for $X \geq 0$ - polynomial tail. Chebyshev: $P(|X-\mu| \geq t) \leq \sigma^2/t^2$ - uses variance. Hoeffding: for independent $X_i \in [a_i, b_i]$, $P(\bar X_n - \mu \geq \varepsilon) \leq \exp\left(-2n^2\varepsilon^2/\sum(b_i-a_i)^2\right)$ - exponential tail. Each successive inequality requires more information about the distribution but gives a much sharper bound.
Azuma-Hoeffding inequality for martingales: if $|X_k - X_{k-1}| \leq c_k$ a.s. (bounded differences), then $P(X_n - X_0 \geq t) \leq \exp(-t^2/(2\sum c_k^2))$. Method of bounded differences: for a function $f(X_1,\ldots,X_n)$ of independent variables where changing one variable changes $f$ by at most $c_i$, Azuma gives $P(f - \mathbb{E}[f] \geq t) \leq \exp(-2t^2/\sum c_i^2)$.
Hoeffding's inequality requires bounded variables. For sub-Gaussian variables (unbounded but with light tails), use sub-Gaussian concentration. For heavy tails (Pareto, Cauchy), neither Hoeffding nor Chebyshev with $p=2$ gives exponential guarantees - you need truncated estimators or $L^p$-moment inequalities.
Pafnuty Chebyshev proved his inequality in 1867 as a tool for proving the law of
Pafnuty Chebyshev proved his inequality in 1867 as a tool for proving the law of large numbers. Wassily Hoeffding obtained his exponential inequality in 1963 while working in statistics. Kazuoki Azuma's theorem (1967) generalized the idea to martingales. In the 1980s-2000s these results became central in computational complexity theory and mathematical machine learning.
Baseline Inequalities: Markov and Chebyshev
Pafnuty Chebyshev proved his classical inequality in 1867 as a tool for the law of large numbers. Today A/B testing at Microsoft Bing across 100 million users uses the Markov to Chebyshev to Hoeffding hierarchy to choose the smallest sample size guaranteeing a conversion-rate estimate within 0.1% at 99% confidence.
$X$ is non-negative with $E[X] = 3$. What does Markov give for $P(X \ge 15)$?
Markov: $P(X \ge t) \le E[X]/t = 3/15 = 1/5$.
Hoeffding's Inequality
Wassily Hoeffding proved an exponential bound for sums of bounded independent variables in 1963, which revolutionized mathematical statistics. The PAC framework introduced by Leslie Valiant in 1984 builds on this bound. For a classifier with generalization gap 0.01 at $\delta = 0.05$ only $n \ge 530$ examples suffice, far fewer than Chebyshev requires.
Two-sided Hoeffding: smallest $n$ guaranteeing $P(|\hat p - p| \ge 0.05) \le 0.01$?
$P(|\hat p - p| \ge \varepsilon) \le 2 e^{-2n\varepsilon^2} \le \delta$, so $n \ge \log(2/\delta)/(2\varepsilon^2) = \log(200)/0.005 \approx 1060$.
Azuma's Inequality for Martingales
Kazuoki Azuma extended Hoeffding to dependent variables in 1967, enabling the analysis of randomized algorithms. The QuickSort analysis (Hoare 1961) on arrays of 10000 elements yields $O(n \log n)$ with exponential concentration: probability of deviating from $E[T]$ by more than 10% is bounded by $e^{-c \log^2 n}$.
An $L$-Lipschitz function $f(X_1,\dots,X_n)$ with $X \sim N(0,I_n)$ satisfies which Gaussian concentration bound?
The Gaussian log-Sobolev inequality yields dimension-free concentration of Lipschitz functions.
Comparing tail bounds in a generalization problem
Setting: $n=1000$, $\varepsilon=0.05$. Chebyshev (with $\sigma^2 \leq 1/4$): $P \leq 0.25/(1000 \cdot 0.0025) = 0.1$. Hoeffding (with $X_i \in [0,1]$): $P \leq e^{-2 \cdot 1000 \cdot 0.0025} = e^{-5} \approx 0.007$. A 14x improvement. At $n=10000$: Chebyshev gives 0.01, Hoeffding gives $e^{-50} \approx 10^{-22}$. For ML on large datasets, Hoeffding is what matters.
Итоги
- Markov $\leq \mathbb{E}/t$, Chebyshev $\leq \sigma^2/t^2$, Hoeffding $\leq e^{-2n\varepsilon^2/(b-a)^2}$: a hierarchy from polynomial to exponential decay.
- Azuma extends Hoeffding to martingales with bounded differences.
- Union bound + Hoeffding = baseline generalization guarantees for finite hypothesis classes.
Connections to other topics
PAC learning (prob-26) builds on these inequalities: VC theory uses concentration for uniform bounds over the entire hypothesis class. High dimensions (prob-21): the JL lemma is proved via Hoeffding concentration of random projections.
- Prob 26 — related
- Prob 21 — related
Вопросы для размышления
- The bounded differences method says: a function insensitive to changing one argument concentrates around its mean. How does this connect to the notion of algorithmic stability in ML - and why do stable algorithms (like ridge regression) have better generalization guarantees?