Statistics
Vapnik - Chervonenkis Theory
Why does a neural network with a billion parameters generalize? Why does a linear model with a thousand features not overfit? Vapnik-Chervonenkis theory gives rigorous answers to questions that every ML engineer asks intuitively.
- SVMs: maximizing the margin = minimizing effective VCdim via ||w||; VC theory explains exactly why this works
- Regularization in neural networks: weight decay and dropout reduce effective complexity, improving VC-based bounds
- AutoML: NAS selects architectures with the right bias-variance balance through implicit VCdim control
Предварительные знания
VC Dimension
In 1968 Vapnik and Chervonenkis tied generalisation to a single integer: the VC dimension. The same number explains why GPT-4 with 10^12 parameters can train at all and underlies the fundamental theorem of statistical learning.
**Infinite VCdim is not the end:** some classes with VCdim = infinity still generalize well in practice (1-NN with a good metric, neural networks with implicit regularization). This apparent contradiction with classical VC theory explains the double descent phenomenon: over-parameterized neural networks often generalize better than well-regularized ones. Modern perspective: uniform convergence alone is insufficient, the algorithm's inductive biases (SGD, architecture) matter.
A hypothesis class H has VCdim = 5. Can H shatter some set of 6 points?
Rademacher Complexity and Generalization
The **empirical Rademacher complexity** is R̂n(H) = E_σ[sup_{h in H} (1/n) sum_i σ_i h(x_i)], where σ_i ~ Uniform{±1} are Rademacher random variables. It measures how well H can fit pure random noise. The **generalization bound** states: L(h) <= L̂(h) + 2Rn(H) + sqrt(log(1/δ)/(2n)) with probability at least 1-δ. Connection to VC: Rn(H) = O(sqrt(VCdim(H) · log(n)/n)).
**VC bounds are loose for neural networks:** GPT-4 has ~10^12 parameters, making VC bounds vacuous. Yet neural networks generalize. Explanations include: implicit regularization of SGD, architectural inductive biases, overparameterized interpolation (benign overfitting), and compression-based bounds. Modern PAC-Bayes bounds give tighter guarantees for neural networks.
Rademacher complexity R̂n(H) = E_σ[sup_h (1/n)∑σ_i h(x_i)], σ_i ~ {±1}. What does R̂n(H) ≈ 0 mean?
PAC Learnability and the Fundamental Theorem of ML
**PAC learnability** (Probably Approximately Correct, Valiant 1984): a class H is PAC learnable if there exists an algorithm A and a function m(ε,δ) such that for any distribution D and any ε,δ in (0,1): when n >= m(ε,δ), with probability at least 1-δ the algorithm outputs h with L(h) <= min_{h* in H} L(h*) + ε. The **Fundamental Theorem of ML**: (finite VCdim) ↔ (PAC learnability). Minimum sample size: m(ε,δ) = O((VCdim(H) + log(1/δ))/ε²).
**Bias-variance through the VC lens:** ERM minimizes empirical loss (variance increases when H is complex); a simple H has high approximation error (bias). The Fundamental Theorem suggests choosing H with VCdim ~ O(sqrt(n)/log n) for the optimal tradeoff. This motivates regularization: it reduces effective VCdim and tightens the bound. SVMs minimize ||w||^2, which explicitly shrinks effective complexity.
According to the Fundamental Theorem of ML, what is the necessary and sufficient condition for a class H to be PAC learnable?
Key Ideas
- VCdim(H) = the largest n such that H shatters at least one set of n points
- Sauer's Lemma: Pi_H(m) <= sum C(m,i) - polynomial growth when VCdim is finite
- R̂n(H): the ability of H to correlate with random noise; equals O(sqrt(VCdim·log n/n))
- Generalization bound: L(h) <= L̂(h) + 2Rn(H) + sqrt(log(1/delta)/(2n))
- Fundamental Theorem: PAC learnability <=> VCdim(H) < infinity
VC Theory in Context
VC theory bridges statistics (generalization) and computational learning theory (PAC). It connects to the bias-variance tradeoff through approximation error vs. estimation error. Regularization is precisely the control of effective VCdim.
- Statistics in ML: theoretical foundations — Bias-variance decomposition is a concrete instance of approximation vs. estimation error in the VC framework
- Robust Statistics — Breakdown point and VC complexity are related through robust M-estimators and their effective dimensionality
Вопросы для размышления
- GPT-4 has ~10^12 parameters and a huge VCdim. Classical VC bounds on generalization are vacuous for it. Yet the model generalizes. What does classical VC theory fail to account for? How does this connect to the double descent phenomenon?
- The No Free Lunch theorem says there is no universally best algorithm. Why then do neural networks dominate most real-world tasks? What implicit 'domain knowledge' are they exploiting?
- A linear classifier and an SVM are trained on the same data. The SVM explicitly minimizes ||w||^2, reducing effective VCdim. How does this shift the PAC generalization curve? At what sample size n does the SVM's guarantee theoretically dominate?