Probability Theory

PAC Learning

A neural network with a billion parameters trained on a million examples generalizes. Classical PAC learning theory predicts catastrophe: you need exponentially many data points. VC theory and modern norm-based bounds explain what actually works and why deep learning avoids overfitting.

SVMs: generalization is guaranteed through the margin - the algorithm implicitly minimizes VC dimension by maximizing the margin. Margin-based bounds are independent of the feature dimension $d$.
L2 regularization (weight decay): adding $\lambda\|w\|^2$ restricts weight norms, and through PAC-Bayes bounds guarantees generalization proportional to $\|w\|^2/n$.
Double descent: classical theory predicts increasing error when parameters exceed $n$, but interpolating neural networks generalize. New theories (implicit regularization, neural tangent kernel) explain this through alternative bounds.

Цели урока

Formulate the PAC learning model and prove bounds for finite hypothesis classes
Define VC dimension and apply the Vapnik-Chervonenkis theorem
Compute the VC dimension for hyperplanes, decision trees, and neural networks

Предварительные знания

Concentration of measure: Hoeffding's inequality
Fundamentals of statistical learning: empirical risk
Combinatorics: Sauer's lemma

PAC model and bounds for finite classes

PAC (Probably Approximately Correct): algorithm $A$ PAC-learns class $\mathcal{H}$ if for any $\varepsilon, \delta > 0$ there exists $m_0(\varepsilon, \delta)$ such that for $m \geq m_0$ and $S \sim D^m$: with probability $\geq 1-\delta$ over $S$, $R(h_S) - \min_h R(h) \leq \varepsilon$. For finite $|\mathcal{H}| < \infty$: it suffices to have $m \geq \frac{\log(|\mathcal{H}|/\delta)}{2\varepsilon^2}$ (from Hoeffding + union bound). The logarithmic dependence on $|\mathcal{H}|$ is the key result.

VC theorem (Vapnik-Chervonenkis, 1971): for a class $\mathcal{H}$ with VC dimension $d$, with probability $1-\delta$: $\sup_h |\hat R(h) - R(h)| \leq \sqrt{\frac{8d \ln(em/d) + 8\ln(4/\delta)}{m}}$. This is a uniform bound over the entire class - the key difference from simply applying Hoeffding. Sauer's lemma: $|\mathcal{H}|$ restricted to $m$ points $\leq \sum_{k=0}^d \binom{m}{k} \leq (em/d)^d$.

VC theory gives vacuous bounds for neural networks: the VC dimension of a network with $W$ weights is $\Omega(W \log W)$ - at $W = 10^{10}$ and $n = 10^6$ the bound exceeds 1. Modern theories (PAC-Bayes, margin bounds, compression bounds) give meaningful guarantees through other complexity measures.

Vladimir Vapnik and Alexei Chervonenkis developed their theory in Moscow in 1968

Vladimir Vapnik and Alexei Chervonenkis developed their theory in Moscow in 1968-1971, publishing in Soviet journals. It became known in the West only in the 1990s through Vapnik's monograph 'The Nature of Statistical Learning Theory'. Leslie Valiant independently proposed the PAC model in 1984, earning the Turing Award in 2010 for it.

PAC model and finite hypothesis classes

Leslie Valiant published 'A Theory of the Learnable' in Communications of the ACM in 1984 (ACM Turing Award 2010). The PAC model linked learnability to the number of training samples: for a class of $|\mathcal{H}| = 10^6$ hypotheses, $m \approx 1680$ examples suffice to find a hypothesis with error $\le 0.01$ at confidence $0.95$.

A class with $|\mathcal{H}| = 10^6$ hypotheses. For $\varepsilon = 0.01$ and $\delta = 0.05$, the minimum PAC sample size is?

VC dimension and shattering

Vladimir Vapnik and Alexey Chervonenkis at the Institute of Control Sciences (USSR Academy of Sciences) introduced this complexity measure in 1971, independent of the number of hypotheses. A neural net with $W$ weights has VC dimension $O(W \log W)$: for GPT-3 with $1.75 \cdot 10^{11}$ parameters the classical VC bound is vacuous, while generalization works in practice. Double descent explains the gap.

What is the VC dimension of halfspaces in $\mathbb{R}^d$?

VC theorem and the generalization bound

Anton Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth proved the fundamental theorem in 1989: $\mathcal{H}$ is PAC learnable if and only if $\mathrm{VC\text{-}dim}(\mathcal{H}) < \infty$. The equivalence shapes modern model design: ResNet-152 on ImageNet, with 60 million parameters, generalizes on 1.28 million examples better than the classical VC bound predicts.

VC dimension of linear classifiers in $\mathbb{R}^{100}$ is 101. For $\varepsilon = 0.05$ and $\delta = 0.05$, the minimum $m$ from the VC theorem is approximately?

VC dimension of the perceptron

Hyperplanes in $\mathbb{R}^d$: VC dimension $= d+1$. Lower bound proof: $d+1$ points in general position can be separated in any way. Upper bound: $d+2$ points cannot be shattered in every configuration. VC theorem consequence: $m \geq C(d \log(d/\varepsilon) + \log(1/\delta))/\varepsilon$ suffices for PAC learning. Note: linear in $d$, not exponential - this makes perceptron training practical.

Итоги

PAC learning: with probability $1-\delta$, error $\leq \varepsilon$ using $m = O(\log(|\mathcal{H}|/\delta)/\varepsilon^2)$ for finite classes.
VC dimension characterizes infinite classes: $m = O(d/\varepsilon^2 \cdot \log(1/\varepsilon\delta))$ suffices.
Modern theory moves beyond VC toward margin bounds, PAC-Bayes, and implicit regularization.

Connections to other topics

Concentration of measure (prob-22) is the main tool: PAC bounds are built from Hoeffding + union bound. Information-theoretic bounds (prob-25): PAC-Bayes bounds use KL divergence between the prior and posterior weight distributions. The Bayesian view on generalization connects margin bounds to regularization.

Prob 22 — related
Prob 25 — related

Вопросы для размышления

An SVM with $d = 10^6$ features has VC dimension $10^6$, but the margin-based bound: $R(h) \leq \hat R(h) + O(\|w\|^2/(\gamma^2 m))$, does not depend on $d$. Why is the margin bound independent of the number of features?
What principle underlies this - and how does weight decay implement it in practice?

Связанные уроки

ml-01-intro