Statistical Learning Theory

Realizable learning and ERM: when the right answer is guaranteed to exist

Number shock: if a hypothesis class contains $2^{1000}$ functions - more than atoms in the universe - around 70,000 examples suffice to guarantee 1% error at 99% confidence. Not exponential blowup. The logarithm tames the class size. This is the central result of realizable PAC.

**SVM with a linear kernel on linearly separable data**: the canonical realizable case. Data is linearly separable - so a linear classifier with zero error exists. ERM in the form of max-margin SVM finds it with PAC guarantees. Sample complexity: $O(d/\varepsilon)$ where $d$ is the input dimension.
**Decision trees without pruning**: if data is realizable (no contradictory examples), a tree of sufficient depth achieves $\hat{R} = 0$. The class is finite for fixed features - PAC applies directly. No pruning equals overfitting: realizability does not mean generalization without sufficient $m$.
**Valiant's 1984 conjunctions**: the original example from the paper. Class of monomial functions over $\{0,1\}^n$: $|H| = 3^n$, sample complexity $O(n \log n / \varepsilon)$. The spam filter of 1984.
**Logistic regression on separable data**: with linear separability, SGD with enough steps finds $\hat{R} = 0$. The PAC theoretical guarantee applies - but without regularization, weights diverge to infinity.
**Double descent paradox**: modern overparameterized networks are formally in the realizable case (they interpolate training data), yet generalization does not collapse. This breaks naive realizable-PAC logic and motivates implicit regularization - the topic of lt-13.

Valiant 1984: algorithm as proof

The paper 'A Theory of the Learnable' (Communications of the ACM, 11 pages) appeared when ML did not exist as a discipline. Statisticians did MLE without thinking about algorithms. Computer scientists studied computational complexity without probabilistic guarantees. Valiant unified the two: a learner is an algorithm whose sample complexity is the measure of computational cost. The realizable case was his starting point - the cleanest scenario for a first theorem. The original example: conjunctions over the Boolean cube. |H| = 3^n (for n variables: each enters positively, negated, or not at all). The ERM algorithm finds the correct conjunction in O(nm) time. Turing Award 2010 - 26 years later. The committee cited PAC as 'one of the most influential results in theoretical computer science'.

Concept 1: realizability and what it buys

The realizable case is the assumption $c \in H$: the true concept lies within the hypothesis class. A strong assumption. Real data is noisy, features are incomplete, nature is nonlinear. But the realizable case gives the first proof and shows PAC mechanics in the cleanest form.

**SVM and the realizable case**: when data is linearly separable, hard-margin SVM is ERM in the realizable case. The max-margin hyperplane minimizes $\hat{R} = 0$ with an additional criterion (maximize margin). PAC bound for SVM in this case: $m = O(d/\varepsilon)$ where $d$ is dimensionality. Margin-based bounds give a sharper picture - covered in lt-11.

Class $H$ consists of all linear classifiers in $\mathbb{R}^2$. Data is linearly separable. What does the realizability assumption guarantee?

Realizability guarantees the existence of h* with zero true error. The PAC bound says: for $m \geq (1/\varepsilon) \ln(|H|/\delta)$, ERM returns a hypothesis with error $\leq \varepsilon$ at probability $\geq 1 - \delta$. ERM does not guarantee $R(h_S)=0$ - only smallness. Max-margin SVM is one valid ERM solution with an extra criterion.

Concept 2: union bound and the sample complexity proof

The proof technique here is used across all of statistical learning theory. The central question: what is the probability ERM fails? Failure means finding a hypothesis with zero training error that generalizes poorly. The answer comes via union bound: sum the risk over all potentially bad hypotheses.

The intuition here: the union bound is pessimistic. Summing over all $|H|$ hypotheses, even though most of them are far from having zero error by chance. This explains the logarithm - the actual dependence on $|H|$ may be tighter. VC-dimension (lt-04) is a sharper measure, replacing $\log|H|$ with the 'effective' complexity of the class.

In the sample complexity proof, the key tool is the union bound. Why does the dependence on $|H|$ end up logarithmic rather than linear?

The union bound introduces linear growth in |H|, but exponential decay in m cancels it. Solving for m, the linear |H| factor moves under the logarithm. A class 10^6 times larger requires only ln(10^6) approx 14 extra examples.

Concept 3: examples of realizable PAC - from conjunctions to neural nets

Abstract theory becomes interesting through concrete hypothesis classes. Three examples - from Valiant's original paper to modern systems - show what the formula actually measures.

Example 1: Valiant's conjunctions (1984)

The original example from the paper

Input space: X = {0,1}^n (binary vectors of length n). Target: a conjunction of some variables and their negations. Example: c(x) = x1 AND NOT x3 AND x5. Hypothesis class H: all such conjunctions. |H| = 3^n (each variable: included, negated, or absent). ERM algorithm: start with the full conjunction x1 AND NOT x1 AND x2 ... For each positive example, drop literals that evaluate to 0. Runtime: O(n*m). Sample complexity: m >= (1/eps) * ln(3^n / delta) = (n * ln 3) / eps + (1/eps) * ln(1/delta) approx (1.1 * n) / eps (dominant term) For n=100, eps=0.05, delta=0.05: m >= 20 * (100 * 1.1 + 3) = 20 * 113 approx 2260 examples. Spam filter 1984: spam is a conjunction of keywords. This was the first PAC-learnable class.

Example 2: linear classifiers in the realizable case

SVM as ERM

Data: m points in R^d, linearly separable. c is the true separating hyperplane (c in H). Hard-margin SVM is ERM with an extra criterion: minimize (1/2)||w||^2 subject to y_i(w^T x_i + b) >= 1 for all i R(h_SVM) = 0 on training data (realizable). PAC bound via VC-dimension (not |H|, since H is infinite): VC-dim of linear classifiers in R^d = d + 1 m >= O((d + 1) / eps) (realizable, via VC) Margin bounds give something sharper: m >= O(R^2 / (gamma^2 * eps)) where gamma is the margin, R is the data radius. Larger margin (well-separated data) means fewer examples needed. Intuition: "the more confident SVM is, the faster it learns".

The third example: modern neural nets. At first glance they are far from the realizable case - infinite classes, astronomical VC-dim, PAC bound useless. But overparameterized networks interpolate training data ($\hat{R} = 0$) - formally realizable. Yet generalization does not collapse. This is the double descent paradox: classical PAC predicts catastrophe for large $|H|$, neural nets refute the prediction. The reason is implicit regularization through architecture and SGD. Details in lt-13.

$\hat{R}(h) = 0$ in the realizable case is a sufficient condition for good generalization

Zero training error is necessary but not sufficient. The PAC bound adds: one needs sufficient $m$ and bounded complexity of $H$.

The class of all functions $X \to \{0,1\}$ is always realizable (any labeling can be memorized). ERM finds $\hat{R} = 0$ for $m < |X|$ by simply memorizing the sample. But $R(h_S) \approx 1/2$. Without restricting $H$, realizability guarantees nothing. This is why $\ln|H|$ appears in the formula - it is the cost of searching over a large class.

Class $H$ has $|H| = 1024 = 2^{10}$ hypotheses, $\varepsilon = 0.1$, $\delta = 0.05$. Minimum $m$ from the PAC formula for the realizable case?

$m \geq (1/0.1) \cdot \ln(1024/0.05) = 10 \cdot \ln(20480) \approx 10 \cdot 9.93 = 99.3$, round up to 100. Note: $|H| = 2^{10} = 1024$ requires only 100 examples - the logarithm tames the combinatorial explosion.

Concept 4: finite H - the limitation and the way forward

Everything above assumed finite $H$. A strong restriction: linear classifiers, neural nets, and trees of arbitrary depth all have $|H| = \infty$. The way forward rests on one key observation: not all hypotheses in $H$ are equally 'informative' on a given sample.

A linear classifier in $\mathbb{R}^2$ has infinitely many parameters, but on $m$ points it can produce at most $O(m^3)$ distinct labelings (growth function theorem). This finite number, not $|H|$, is what matters. Replacing $\log|H|$ with the logarithm of the number of distinguishable hypotheses yields bounds for infinite $H$.

**Agnostic PAC (lt-03)**: drop $c \in H$. Sample complexity grows as $1/\varepsilon^2$. The realistic setting for practice.
**VC-dimension (lt-04)**: replace $\log|H|$ with $d$. Works for infinite classes. Vapnik and Chervonenkis 1971.
**Uniform convergence (lt-07)**: prove that $\hat{R}(h) \approx R(h)$ simultaneously for all $h \in H$ - a strengthening from which realizable PAC follows as a special case.
**Margin bounds (lt-11)**: for SVM and margin-based classifiers. Bound via margin $\gamma$ rather than $d$ - can be tighter than the VC bound.
**Deep generalization (lt-13)**: why modern neural nets with $|H| = \infty$ and formally realizable behavior ($\hat{R} = 0$) generalize despite the bound.

**The realizable case is not realistic.** In production, data is always noisy ($c \notin H$ almost surely), the distribution is non-stationary, and IID is violated. Realizable PAC is valuable not as an applied theorem but as a building block: all further results are built on its proof. Understanding the union bound and the logarithm in $|H|$ is the prerequisite for agnostic PAC, VC, Rademacher, and PAC-Bayes.

Quick check: which best summarizes "Concept 4: finite H - the limitation and the way forward"?

Key takeaways

**Realizability ($c \in H$)**: the idealized scenario - the right hypothesis exists in the class. ERM finds $\hat{R} = 0$ and PAC guarantees small true error.
**Sample complexity**: $m \geq (1/\varepsilon) \ln(|H|/\delta)$. Linear in $1/\varepsilon$, logarithmic in $|H|$. Doubling $|H|$ adds $\ln 2 / \varepsilon$ examples.
**Union bound**: the central proof technique. Sum the risk over bad hypotheses; each is killed exponentially fast as $m$ grows. Hence the logarithm.
**Finite vs infinite H**: finite - bound via $\log|H|$. Infinite - via VC-dimension $d$. Same proof structure.
**SVM, trees, conjunctions**: three concrete classes where realizable PAC applies directly. Double descent in neural nets marks where the theory breaks - see lt-13.
**Realizable vs agnostic**: first is $O(1/\varepsilon)$, second is $O(1/\varepsilon^2)$. The gap explains why knowing $c \in H$ halves the data cost.

Вопросы для размышления

The realizable case assumes $c \in H$. In which real-world settings is this a reasonable assumption, and where is it directly false?
Sample complexity is logarithmic in $|H|$. Why then do LLMs with $|H| \approx 10^{10^{12}}$ require petabytes of data - is not the logarithm of that still small?
The union bound is pessimistic: it sums over all $|H|$ hypotheses, not just $|H_{bad}|$. Where does this pessimism most severely weaken the bound?
SVM in the realizable case is ERM with an extra criterion (maximize margin). Why is this extra criterion needed if PAC already guarantees small error?

What this lesson unlocks

Realizable PAC is the foundation on which the following results are built:

Agnostic PAC — Drop $c \in H$: quadratic dependence on $1/\varepsilon$, the realistic setting
VC dimension — Replace $\log|H|$ with VC-dim: works for infinite classes
Uniform convergence — Strengthening: $\hat{R} \approx R$ for all $h \in H$ simultaneously
Margin bounds — Realizable SVM is a special case; margin gives a sharper bound
Generalization paradox in deep learning — Neural nets violate realizable logic: hat-R=0 does not imply poor generalization

Связанные уроки

lt-03-agnostic-pac — Agnostic - next step: drop the c in H assumption
lt-04-vc-dimension — VC-dim generalizes log|H| to infinite hypothesis classes
lt-11-margin-bounds — Margin bounds explain SVM through the PAC-realizable lens
prob-22-concentration — Union bound and Markov inequality underpin the proof
lt-13-deep-generalization — Double descent breaks realizable logic for neural nets
stat-01-sampling