Statistical Learning Theory

Growth function and Sauer's lemma

1971. Vladimir Vapnik and Alexey Chervonenkis ask: how many distinct ways can a hypothesis class H partition m points into two classes? The answer is not infinity. For a linear classifier in R^d it is exactly a polynomial of degree d. This polynomial nature makes generalization possible - an infinite hypothesis class behaves like a finite one.

**ImageNet training:** 1.28M examples, a model with 10^9 parameters - growth function is astronomically large in theory, but double descent explains generalization where VC-theory goes silent
**Neural scaling laws (Kaplan 2020):** test loss ~ (data/params)^alpha - an empirical curve that reproduces PAC-theory predictions for small models and departs from them for large ones
**Sample complexity:** PAC-bound via growth function gives m >= O(d log(1/eps)/eps) - exactly how many training examples guarantee generalization for VC-dimension d

Предварительные знания

VC-dimension: maximum size of a set shattered by H
Shattering: H shatters S if it realizes all 2^|S| labelings

VC dimension: measuring hypothesis class complexity

Growth function: counting patterns

Let H be a hypothesis class, X the feature space. The **growth function** $\Pi_H(m)$ counts the maximum number of distinct labelings H can produce on m points:

Simpler: take m points in arbitrary position, count how many distinct colorings class H can paint on them. Take the maximum over all possible placements - that is $\Pi_H(m)$. The naive upper bound is $2^m$. The question is how far the actual growth is from this ceiling.

**Three classes - three fates of growth function.** Finite class $|H| = N$: $\Pi_H(m) \leq N$, constant growth. Linear classifiers in $\mathbb{R}^d$: $\Pi_H(m) = O(m^d)$, polynomial growth. Class of all functions: $\Pi_H(m) = 2^m$, exponential growth. Sauer's lemma explains why the second case is the norm for reasonable hypothesis classes.

Class H consists of all threshold functions h_t(x) = sign(x - t) on the real line. What is Pi_H(2)?

Sauer-Shelah lemma: polynomial defeats exponential

1972. Norbert Sauer and independently Saharon Shelah prove one lemma. Twenty years later it becomes the foundation of PAC-learning.

**Sauer-Shelah lemma.** If $\text{VCdim}(H) = d$, then for all $m \geq 1$: $$\Pi_H(m) \leq \sum_{i=0}^{d} \binom{m}{i}$$ For $m \geq d$ the right-hand side equals $O(m^d)$ - polynomial growth of degree d.

What this means in practice: a class with VC-dimension $d = 10$ can realize not $2^{1000}$ but roughly $1000^{10} \approx 10^{30}$ patterns on 1000 points. Large - but finite. The logarithm of growth function ($\log \Pi_H(m) \leq d \log m$) replaces $\log |H|$ in generalization bounds - exactly how an infinite hypothesis class obtains a finite generalization guarantee.

At $m = 100$ the gap between actual growth and the theoretical ceiling is $10^{24}$ times. That distance turns an infinite hypothesis class into something tractable from a generalization standpoint.

Class H has VCdim = 5. Estimate Pi_H(100) via Sauer's lemma. Nearest order of magnitude?

PAC-bound via growth function

The growth function replaces $|H|$ in the union bound. The standard generalization bound for a finite class: $$\mathbb{P}[\sup_{h \in H} |R(h) - \hat{R}(h)| > \epsilon] \leq 2|H| e^{-2m\epsilon^2}$$ For an infinite class H, replace $|H|$ with $\Pi_H(2m)$ - the growth function on $2m$ points (the ghost sample symmetrization needs double size):

Plugging in Sauer's lemma ($\Pi_H(2m) \leq (2m)^d$) and solving for $m$ gives the PAC sample complexity:

**Reading this formula literally:** for a linear classifier in $\mathbb{R}^d$ to generalize with error no worse than $\epsilon$ with probability $1 - \delta$, one needs $\Theta(d / \epsilon^2)$ examples - linear in VC-dimension. For logistic regression in $\mathbb{R}^{1000}$ that is roughly $10^4$ examples at $\epsilon = 0.05$. For a neural network with $d \sim 10^9$ - theory demands $\sim 10^{10}$ examples. ImageNet has 1.28M. Something is off.

Why does the PAC-bound use Pi_H(2m) instead of Pi_H(m)?

Double descent: where VC-theory goes silent

GPT-3 has 175 billion parameters. By VC-theory it needs hundreds of billions of examples. In practice - Common Crawl plus Books3 sufficed (hundreds of billions of tokens, but not labeled data for a specific task). BERT fine-tuning on GLUE works with a few thousand examples. VC-theory has nothing to say.

**Double descent (Belkin et al. 2019, Nakkiran et al. 2020):** when the parameter count exceeds the example count, test error does not grow monotonically - it drops again. The classical U-shaped bias-variance tradeoff exists only in the left portion. Neural scaling laws (Kaplan 2020) describe the right portion:

The growth function and Sauer's lemma remain correct - they are simply not tight enough for the overparameterized regime. VC-dimension bounds for neural networks are vacuously true and practically useless. Rademacher complexity and PAC-Bayes give tighter bounds, but the gap between best theoretical guarantees and empirical results remains orders of magnitude wide.

A model with d=10^9 parameters generalizes to 50,000 test examples better than VC-theory predicts. The most accurate explanation is:

Key takeaways

**Growth function** $\Pi_H(m)$ - maximum number of labelings H produces on m points. Hard ceiling $2^m$, actual growth depends on VC-dimension
**Sauer-Shelah lemma:** $\Pi_H(m) \leq \sum_{i=0}^{d} \binom{m}{i} = O(m^d)$ when VCdim $= d$. An infinite class behaves like a polynomially large finite one
**PAC sample complexity via growth function:** $m \geq O(d \log(1/\epsilon) / \epsilon^2)$ - linear in VC-dim, semi-logarithmic in accuracy
**Double descent:** for overparameterized neural networks VC-bounds are off by 5-10 orders of magnitude. Neural scaling laws (Kaplan 2020) provide an empirical replacement
**Practical summary:** growth function explains generalization of classical methods (SVM, logistic regression). For deep networks - necessary but not sufficient

What comes next

Growth function is one tool in the generalization theory arsenal:

Rademacher complexity — Tighter measure: accounts for data distribution, not only combinatorial complexity
Uniform convergence — The theorem that uses Sauer's lemma as a key ingredient in its proof
PAC-Bayes bounds — Stochastic bounds giving tighter guarantees for neural networks via prior/posterior
Deep generalization — Empirical phenomena beyond VC-theory: double descent, grokking, neural scaling

Вопросы для размышления

A linear classifier in R^100 has VC-dim around 101. PAC-bound requires roughly 2000 examples at eps=0.1. A neural network with 10^7 parameters on the same task works with 500 examples. How do these two facts coexist?
Sauer's bound gives Pi_H(m) = O(m^d). What happens to bounds when d grows with m - for example, d = m/2?
Neural scaling laws say: test_loss ~ (data/params)^0.076. How does this relate to O(d log(1/eps)/eps^2) from PAC-theory for small models?

Связанные уроки

lt-04-vc-dimension — VC-dimension as the threshold where growth switches from polynomial to exponential
lt-06-rademacher — Rademacher complexity: a finer measure replacing growth function in modern bounds
lt-07-uniform-convergence — Uniform convergence theorem is built on top of Sauer's lemma
lt-13-deep-generalization — Double descent and neural scaling laws - empirical answer where VC-theory breaks down
lt-03-agnostic-pac — PAC-bounds via growth function extend the agnostic learning framework
stat-01-sampling