Statistical Learning Theory
Uniform Convergence: One Bound for the Whole Hypothesis Class
ERM picks a hypothesis that minimizes training error on a specific sample. The same sample was used to evaluate quality - it can happen that the most overfit solution achieves zero training error while its true error is 50%. Pointwise concentration (LLN) guarantees $L_S(h) \to L_D(h)$ for each fixed $h$, but a fixed $h$ is not equal to $\hat{h}$ - the latter is chosen FROM the data. Uniform convergence removes this issue in one stroke: with high probability, $L_S$ stays close to $L_D$ SIMULTANEOUSLY for every $h \in H$. Then choosing $\hat{h}$ via $L_S$ automatically yields a good $L_D$ - this is exactly what makes ERM an honest learner.
- **SVM regularization:** the constraint $\|w\|_2 \leq B$ shrinks the hypothesis class just enough for the UC bound to become non-trivial. Without the constraint - infinite VC, no generalization guarantees
- **Neural network capacity control:** weight decay, dropout, early stopping - all these techniques implicitly act as a way to narrow the effective hypothesis class down to a level where UC becomes applicable
- **Modern paradox:** GPT-4 has $\sim 10^{12}$ parameters and vacuous classical UC bounds, yet it generalizes better than any classical model. Open question: what REPLACES UC in this regime (see lt-13, lt-14)
Why uniform convergence is needed
ERM picks $\hat{h}$ that minimizes training error on a specific sample. But the same sample was used to evaluate quality - it can happen that the most overfit $\hat{h}$ achieves zero training error while its true error is 50%. Pointwise concentration (LLN) guarantees $L_S(h) \to L_D(h)$ for each fixed $h$, but a fixed $h$ is not equal to $\hat{h}$ - the latter is chosen FROM the data. Uniform convergence removes this issue in one stroke: with high probability, $L_S$ stays close to $L_D$ SIMULTANEOUSLY for every $h \in H$. Then choosing $\hat{h}$ via $L_S$ automatically yields a good $L_D$ - this is exactly what makes ERM an honest learner.
Pointwise convergence $L_S(h) \to L_D(h)$ is a consequence of the law of large numbers and holds coordinatewise for each $h$. Uniform convergence is a radically stronger statement: $\sup_{h \in H} |L_S(h) - L_D(h)| \to 0$. The distinction becomes critical when $|H|$ is large or infinite.
**Counterexample, pointwise vs uniform.** Let $H = \{h_S : S \subseteq X, |S| < \infty\}$ - the class of indicators of finite sets. For any fixed $h_S$ under continuous $D$, we have $L_D(h_S) = L_S(h_S) = 0$ (set of measure zero). But $\sup_h |L_S(h) - L_D(h)| = 1$ always - one can pick $h$ equal to one exactly on the training sample. Pointwise convergence holds, uniform does not. The class is not learnable despite trivial pointwise convergence. **ML applications inline:** - **SVM regularization:** the constraint $\|w\|_2 \leq B$ shrinks the hypothesis class just enough for the UC bound to become non-trivial - **Neural network capacity control:** weight decay, dropout, early stopping - all these techniques implicitly act as a way to narrow the effective hypothesis class - **Modern paradox:** GPT-4 has $\sim 10^{12}$ parameters and vacuous classical UC bounds, yet generalizes better than any classical model
How does pointwise convergence structurally differ from uniform convergence?
Formal definition and the main theorem
A class $H$ has the uniform convergence property with respect to $D$ if the empirical risk concentrates uniformly around the true risk: $$\Pr_{S \sim D^n}\left[\sup_{h \in H} |L_S(h) - L_D(h)| > \epsilon\right] \to 0 \text{ as } n \to \infty.$$ An equivalent formulation uses $\epsilon$-representativeness: a training set $S$ is $\epsilon$-representative for $H$ if $|L_S(h) - L_D(h)| < \epsilon$ holds for all $h \in H$ at once. UC is the statement that the sample is $\epsilon$-representative with high probability.
**Fundamental Theorem of Statistical Learning (Vapnik-Chervonenkis 1971, Blumer-Ehrenfeucht-Haussler-Warmuth 1989).** For a binary hypothesis class $H$ the following are equivalent: 1. $H$ has the uniform convergence property 2. ERM is a PAC learner for $H$ 3. $H$ is agnostic PAC learnable 4. $\text{VCdim}(H) < \infty$ Thus finite VC dimension is the NECESSARY AND SUFFICIENT condition for learnability via ERM. **Sample complexity derived from uniform convergence:** $$m_H(\epsilon, \delta) \leq O\!\left(\frac{\text{VCdim}(H) \log(1/\epsilon) + \log(1/\delta)}{\epsilon^2}\right).$$ For the realizable case (a hypothesis with zero risk exists) the exponent $1/\epsilon^2$ improves to $1/\epsilon$. The agnostic case retains $1/\epsilon^2$ - this is a theoretical lower bound.
**ML hook: VC dimension of a neural network is $\Theta(W \log W)$ (Bartlett-Maiorov-Meir 1998), where $W$ is the parameter count.** For GPT-4 with $W \sim 10^{12}$, the classical UC bound gives a sample complexity of $\sim 10^{13}$ examples for non-trivial accuracy - more than the entire internet. The bound is mathematically correct yet practically vacuous. Why GPT generalizes on far smaller datasets is an open question discussed in lt-13 and lt-14.
Roughly what sample complexity does uniform convergence yield for a class with $\text{VCdim}(H) = 10$, $\epsilon = 0.05$, $\delta = 0.01$ in the agnostic case?
Rademacher complexity: a tight UC bound
The VC bound for UC is distribution-free and often overly pessimistic. Rademacher complexity (lt-06) yields a data-dependent version of the same inequality, much tighter in practice. $$\hat{R}_S(H) = \mathbb{E}_\sigma\!\left[\sup_{h \in H} \frac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)\right], \quad \sigma_i \in \{-1, +1\} \text{ i.i.d.}$$
**Rademacher uniform convergence bound.** For a loss $\ell$ bounded in $[0, 1]$, with probability at least $1 - \delta$ over $S \sim D^n$, for all $h \in H$ simultaneously: $$|L_S(h) - L_D(h)| \leq 2\hat{R}_S(H) + 3\sqrt{\frac{\log(2/\delta)}{2n}}.$$ Key property: $\hat{R}_S(H)$ is estimated directly from the training sample - the bound is empirically computable without knowing the true distribution. **Advantages over the VC bound:** - **Data-dependent:** the bound captures the structure of the actual sample $S$, not the worst case across all possible $D$ - **Computable:** $\hat{R}_S(H)$ can be estimated by simulation - randomly label $S$ with signs $\sigma$ and train a model on this random labelling - **Tighter:** for kernel SVM with margin $\gamma$, Rademacher gives a bound $O(BR/(\gamma\sqrt{n}))$ with no dependence on feature space dimension - the VC bound is vacuous in the same setting (infinite VC) - **Compositional:** Rademacher composes cleanly under Lipschitz functions and convex combinations, which matters for analysing neural networks
**ML hook: PyTorch Lightning's effective batch size.** Calibration of batch size during distributed training partly relies on Rademacher-style analysis: the effective batch size has to be such that the concentration term $\sqrt{\log(1/\delta)/n}$ stays small relative to $\hat{R}_S$. In practice this translates into heuristics like the linear scaling rule (Goyal 2017): $\eta \propto B$ for $B \leq 8192$, beyond which warmup strategies are required.
What makes Rademacher bounds strictly stronger than VC bounds for kernel SVM?
When UC fails: the modern frontier
Classical UC theory rests on three assumptions, each of which breaks in modern ML: - **Finite VC dimension.** Deep networks with billions of parameters have formally finite but enormous VC dim. The bound $\sqrt{\text{VCdim}/n}$ blows past one - vacuous. - **Bounded loss.** Cross-entropy on rare tokens (LLM training) gives heavy-tailed losses: $-\log p$ as $p \to 0$ is unbounded. Standard concentration inequalities (Hoeffding, McDiarmid) break. - **i.i.d. sample.** Distribution shift between training and deployment, online learning, RLHF - all are off-i.i.d. scenarios. - **Algorithm-independence.** Classical UC reasons about the entire class $H$. But SGD does not roam the whole class - the optimizer's implicit bias narrows the set of reachable hypotheses dramatically.
**Modern alternatives to UC.** - **PAC-Bayes (lt-09):** bound through a distribution $Q$ over hypotheses and its divergence from a prior $P$. Dziugaite-Roy 2017 obtained the first non-vacuous bound for a CNN on MNIST: $0.165$ versus an actual error of $0.013$ - **Information-theoretic bounds (lt-14):** generalization gap is bounded by mutual information $I(S; \hat{h})$. Xu-Raginsky 2017 proved $\mathbb{E}[\text{gap}] \leq \sqrt{2\sigma^2 I(S; \hat{h})/n}$ - the bound depends on HOW MUCH information about the sample is encoded in the hypothesis - **Implicit regularization analysis:** Soudry-Hoffer-Srebro 2018 proved that SGD on linearly separable data converges to the max-margin solution - an implicit narrowing of the class to min-norm hypotheses - **Neural Tangent Kernel:** in the infinite-width regime, training reduces to a kernel method with a known RKHS for which classical theory yields non-trivial guarantees
**ML hook: the GPT-4 paradox.** The classical UC bound for GPT-4 is vacuous by many orders of magnitude. The model nonetheless shows zero-shot generalization on tasks unseen during training. Candidate explanations: feature learning (representations generalize better than the class as a whole), implicit SGD bias (the optimizer settles on solutions with small effective capacity inside an enormous class), sparse activations (the actual computational complexity of the network is far below its nominal capacity), data overlap (the training distribution is rich enough to cover most of the test distribution). Each is an active research direction. No single answer exists yet.
Why does the classical uniform convergence bound give non-trivial guarantees for logistic regression on ImageNet but not for GPT-4 on the same data volume?
Key takeaways
- **Uniform convergence:** $\sup_{h \in H} |L_S(h) - L_D(h)| \to 0$ - strictly stronger than pointwise concentration; UC is what makes ERM an honest learner rather than a blind one
- **Vapnik-Chervonenkis theorem (statistical learning):** for binary classes, UC $\Leftrightarrow$ ERM is a PAC learner $\Leftrightarrow$ $\text{VCdim}(H) < \infty$. Finite VC is necessary and sufficient for learnability
- **Sample complexity:** $m(\epsilon, \delta) \leq O(\text{VCdim}/\epsilon^2 + \log(1/\delta)/\epsilon^2)$ in the agnostic case; Rademacher gives a data-dependent refinement via $2\hat{R}_S(H) + 3\sqrt{\log(2/\delta)/(2n)}$
- **When UC breaks:** overparameterized networks (vacuous bounds), heavy-tailed losses, distribution shift, implicit algorithmic bias. Replacements include PAC-Bayes (lt-09), information-theoretic bounds (lt-14), and implicit regularization analysis
Where to go next
Uniform convergence is the central theorem of classical generalization theory. Where it leads:
- VC dimension — VC dimension enters the UC bound as the main complexity multiplier; finite VC is a necessary condition for UC
- Rademacher complexity — Data-dependent refinement of the UC bound; the only non-trivial bound for kernel methods with infinite VC
- Information-theoretic bounds — Modern replacement for UC via mutual information $I(S; \hat{h})$ - delivers non-trivial guarantees where UC is vacuous
Вопросы для размышления
- A class with one fixed hypothesis $H = \{h_0\}$ has $\text{VCdim} = 0$ and directly satisfies UC. Why is such a class uninteresting in practice, and how does this relate to the trade-off between approximation error and estimation error?
- Empirical Rademacher simulation for a neural network requires training the model on random labels - exactly the setup of Zhang et al. 2017. Why does this empirically demonstrate that the classical Rademacher bound for modern architectures is vacuous, and which alternatives (PAC-Bayes, info-theoretic) sidestep the issue?
- The Vapnik-Chervonenkis theorem states the equivalence of UC and PAC learnability via ERM. Are there learnable classes for which ERM fails while another algorithm succeeds? Hint: this connects to the distinction between proper and improper learning, and to the fact that for some classes (such as k-DNF) the computational complexity of ERM makes it practically unusable.
Связанные уроки
- lt-04-vc-dimension — VC bounds the UC sample complexity
- lt-06-rademacher — Tighter UC bound via Rademacher
- lt-09-pac-bayes — Bayesian relaxation of UC
- lt-13-deep-generalization — UC fails for deep nets - what replaces it
- lt-14-info-bounds — Information-theoretic UC alternatives
- stat-01-sampling