Statistical Learning Theory

VC Dimension and Sample Complexity

GPT-4 trains on 45TB of text (~13 trillion tokens). VC theory explains: with VCdim=d one need O(d/ε²) samples for ε-error. For modern LLMs d ~ number of parameters, but Rademacher bounds with norm constraints give finite guarantees. Amazon uses PAC bounds to estimate minimum dataset sizes in AutoML. Google Brain applies Rademacher complexity to choose regularization strength in production models.

  • **AutoML sample size estimation:** Amazon SageMaker Autopilot estimates the minimum dataset size via VC bounds before launching architecture search, saving hours of compute.
  • **SVM in bioinformatics:** SVM with RBF kernel classifies gene expression (20000+ features, 100-1000 samples). VCdim=∞, but Rademacher bounds explain generalization.
  • **Neural scaling laws:** Chinchilla (DeepMind 2022) , n_tokens ≈ 20·n_params is optimal. This aligns with PAC sample complexity O(d/ε²) with d ~ parameters.

VC Dimension: shattering and growth function

**GPT-4 trains on 45TB of text, but VC theory explains: with VC dimension d, one need O(d/ε) samples for ε-accuracy with 95% probability.** Why does a halfplane in ℝ² need only 3 points for shattering, while a linear classifier in ℝ¹⁰⁰⁰⁰ needs 10001? This gap governs how much data ImageNet requires vs MNIST.

**VC dim of common classes:** Threshold functions on ℝ: 1. Halfspaces in ℝᵈ: d+1. Neural net with W parameters: O(W log W). SVM with RBF kernel: ∞ (but Rademacher complexity is finite).

What is the VC dimension of linear classifiers in ℝᵈ?

A linear classifier in ℝᵈ has d+1 parameters (d weights + bias). Any d+1 points in general position can be shattered, but no d+2 points can. Hence VCdim = d+1.

PAC Learning: the fundamental theorem

**Probably Approximately Correct (PAC) learning** (Valiant 1984): algorithm A PAC-learns class H if for any ε, δ > 0, with enough samples it outputs h with err(h) ≤ ε with probability ≥ 1-δ. The fundamental theorem: H is PAC-learnable if and only if its VC dimension is finite.

**No Free Lunch:** without assumptions about the distribution, no algorithm learns better than random. PAC learnability relies on H having finite VCdim , that is the inductive bias.

What is the necessary and sufficient condition for PAC learnability of class H?

By the fundamental theorem of PAC learning (Blumer et al., 1989): H is PAC-learnable iff VCdim(H) < ∞. The class can be infinite (all linear classifiers) yet PAC-learnable if VC dim is finite.

Rademacher Complexity: uniform convergence beyond VC

**Rademacher complexity** measures the expressiveness of a function class through its ability to fit random noise. Unlike VC dim, it is data-dependent and yields tighter bounds for specific distributions. Rademacher bounds underpin modern generalization guarantees for neural networks.

**When Rademacher beats VC:** when weights are norm-constrained (L2 regularization), when data is low-dimensional, when VCdim=∞ (RBF SVM, deep networks). Modern deep learning bounds use Rademacher with spectral norm constraints.

When is the Rademacher bound tighter than the VC bound?

Rademacher complexity of linear classifiers is O(BR/√n), independent of dimension d. VC bound is O(√(d log n / n)), which grows with d. When d is large and weights are bounded, Rademacher is tighter. For RBF SVM VCdim=∞ but Rademacher is finite.

Key ideas

  • **VC dimension:** VCdim(H) = max shatterable set size. Halfplanes in ℝᵈ: d+1. Neural net with W weights: O(W log W).
  • **Sauer-Shelah lemma:** Pi_H(m) ≤ (em/d)^d , polynomial growth function bound makes uniform convergence possible.
  • **PAC sample complexity:** n = O(d·log(1/ε)/ε² + log(1/δ)/ε²) samples suffice for ε-accuracy with probability 1-δ.
  • **Fundamental theorem:** H is PAC-learnable iff VCdim(H) < ∞.
  • **Rademacher complexity:** data-dependent measure, dimension-independent for norm-constrained weights. Tighter than VC for RBF SVM.

Related topics

VC dim and PAC are the foundation of all generalization theory:

  • Rademacher and uniform convergence — Lessons 6-7: proof details
  • Kernel Methods — Previous lesson: RKHS where VCdim=∞ but generalization holds

Связанные уроки

  • lt-04-vc-dimension — Fresh view from PAC and Rademacher perspectives
  • lt-06-rademacher — Rademacher complexity as a modern alternative to VC
  • lt-07-uniform-convergence — Uniform convergence is the core proof tool
VC Dimension and Sample Complexity

0

1

Sign In