Statistics
Empirical Processes and VC Theory
Why does a statistical learning algorithm that performs well on training data generalize to new examples? What is the mathematical reason for generalization?
- **TensorFlow Probability:** Kolmogorov-Smirnov goodness-of-fit test runs in O(n log n) - a direct consequence of Donsker's theorem on the empirical process
- **PAC learning theory:** generalization guarantees for neural networks via Rademacher complexity of the function class
- **Data drift detection:** two-sample tests in production ML systems monitoring shifts in the input distribution
- **Quantile regression:** uniform convergence of empirical quantiles is a consequence of VC theory and empirical processes
Предварительные знания
- Law of large numbers
- Central limit theorem
- Empirical CDF
Empirical Processes and the Glivenko-Cantelli Theorem
The empirical measure P_n = (1/n)∑δ_{X_i} approximates the true distribution P. The Glivenko-Cantelli theorem guarantees uniform convergence sup_x|F̂_n(x) − F(x)| → 0 almost surely - the fundamental theorem of statistics. The empirical process G_n(f) = √n(P_nf − Pf) studies the rate and limiting distribution of this deviation for an entire class of functions F. Donsker's theorem: when the entropy integral is finite, G_n converges weakly to a Gaussian process in the space of bounded functions.
How is the Glivenko-Cantelli theorem stronger than the pointwise law of large numbers?
The pointwise LLN says for each fixed x: F̂_n(x) → F(x) a.s. Glivenko-Cantelli says sup over all x simultaneously tends to 0 a.s. This is uniform convergence, a strictly stronger statement. Both have O(1/√n) rate; the DKW inequality gives P(√n·sup|F̂_n−F| > t) ≤ 2e^{−2t²}.
VC Dimension and Uniform Convergence (Donsker's Theorem)
The VC dimension d_VC(F) is a combinatorial complexity measure: the maximum number of points that F can shatter (correctly classify in all 2^n binary label configurations). The key Vapnik-Chervonenkis result: the uniform LLN holds if and only if d_VC(F) < ∞. This provides the theoretical foundation of PAC-learnability: finite VC dimension is equivalent to F being PAC-learnable.
Why does sin(kx) have infinite VC dimension despite depending on only one parameter k?
For any n points x₁ < ... < xₙ and any binary labeling (y₁,...,yₙ) in {±1}ⁿ, there exists k (with sufficiently high frequency) such that sign(sin(kxᵢ)) = yᵢ for all i. This means {sign(sin(kx)): k > 0} shatters every finite set. Number of parameters does not determine VC dimension - functional form does.
Function Classes and Rademacher Complexity
Rademacher complexity R_n(F) is a finer, data-dependent measure of the flexibility of a function class F. It measures how well F can fit random noise: if F can match any random pattern, R_n is high. Generalization bounds via R_n are sharper than VC bounds: they depend on the specific distribution P rather than the worst case. Dudley's inequality connects R_n to the entropy integral - a metric complexity measure for the function class.
Why are generalization bounds via Rademacher complexity R_n sharper than bounds via VC dimension d_VC?
d_VC(F) is worst-case: maximum shattering over any distribution. R_n(F) measures complexity of F on data from the actual P. For a benign P, F may be far simpler than d_VC predicts. The empirical R̂_n is sharper still: it depends on the specific training sample X₁,...,Xₙ, not just on P.
Empirical processes and machine learning
Empirical process theory provides the theoretical foundation for generalization guarantees in ML.
- PAC learning — Theoretical foundation: PAC-learnability is equivalent to finite VC dimension
- Deep learning — Neural network generalisation bounds via Rademacher complexity and weight norms
- Multiple testing — FWER control through empirical processes and the Dvoretzky-Kiefer-Wolfowitz inequality
Итоги
- Glivenko-Cantelli: sup_x|F̂_n(x)−F(x)| → 0 a.s. - uniform convergence of the empirical CDF
- Empirical process G_n(f) = √n(P_nf − Pf); Donsker: G_n →d G (Gaussian process) when the entropy integral is finite
- VC dimension d_VC(F): max shattering number; uniform LLN holds if and only if d_VC < ∞
- Sauer-Shelah: m_F(n) ≤ (en/d)^d when d_VC = d - polynomial growth of the shatter function
- VC inequality: sup|P_nf−Pf| ≤ C√(d_VC·log n / n) with high probability
- Rademacher complexity R_n(F): distribution-dependent flexibility measure; bound ≤ 2R_n + √(log(1/δ)/(2n))
- Dudley's inequality: R_n(F) ≤ (12/√n)·∫√(log N(ε))dε - connects covering numbers to generalization
- PAC-learnability if and only if d_VC < ∞ (Blumer-Ehrenfeucht theorem)