Statistics

Empirical Processes and VC Theory

Why does a statistical learning algorithm that performs well on training data generalize to new examples? What is the mathematical reason for generalization?

**TensorFlow Probability:** Kolmogorov-Smirnov goodness-of-fit test runs in O(n log n) - a direct consequence of Donsker's theorem on the empirical process
**PAC learning theory:** generalization guarantees for neural networks via Rademacher complexity of the function class
**Data drift detection:** two-sample tests in production ML systems monitoring shifts in the input distribution
**Quantile regression:** uniform convergence of empirical quantiles is a consequence of VC theory and empirical processes

Предварительные знания

Law of large numbers
Central limit theorem
Empirical CDF

Empirical Processes and the Glivenko-Cantelli Theorem

The empirical measure P_n = (1/n)∑δ_{X_i} approximates the true distribution P. The Glivenko-Cantelli theorem guarantees uniform convergence sup_x|F̂_n(x) − F(x)| → 0 almost surely - the fundamental theorem of statistics. The empirical process G_n(f) = √n(P_nf − Pf) studies the rate and limiting distribution of this deviation for an entire class of functions F. Donsker's theorem: when the entropy integral is finite, G_n converges weakly to a Gaussian process in the space of bounded functions.

How is the Glivenko-Cantelli theorem stronger than the pointwise law of large numbers?

The pointwise LLN says for each fixed x: F̂_n(x) → F(x) a.s. Glivenko-Cantelli says sup over all x simultaneously tends to 0 a.s. This is uniform convergence, a strictly stronger statement. Both have O(1/√n) rate; the DKW inequality gives P(√n·sup|F̂_n−F| > t) ≤ 2e^{−2t²}.

VC Dimension and Uniform Convergence (Donsker's Theorem)

The VC dimension d_VC(F) is a combinatorial complexity measure: the maximum number of points that F can shatter (correctly classify in all 2^n binary label configurations). The key Vapnik-Chervonenkis result: the uniform LLN holds if and only if d_VC(F) < ∞. This provides the theoretical foundation of PAC-learnability: finite VC dimension is equivalent to F being PAC-learnable.

Why does sin(kx) have infinite VC dimension despite depending on only one parameter k?

For any n points x₁ < ... < xₙ and any binary labeling (y₁,...,yₙ) in {±1}ⁿ, there exists k (with sufficiently high frequency) such that sign(sin(kxᵢ)) = yᵢ for all i. This means {sign(sin(kx)): k > 0} shatters every finite set. Number of parameters does not determine VC dimension - functional form does.

Function Classes and Rademacher Complexity

Rademacher complexity R_n(F) is a finer, data-dependent measure of the flexibility of a function class F. It measures how well F can fit random noise: if F can match any random pattern, R_n is high. Generalization bounds via R_n are sharper than VC bounds: they depend on the specific distribution P rather than the worst case. Dudley's inequality connects R_n to the entropy integral - a metric complexity measure for the function class.

Why are generalization bounds via Rademacher complexity R_n sharper than bounds via VC dimension d_VC?

d_VC(F) is worst-case: maximum shattering over any distribution. R_n(F) measures complexity of F on data from the actual P. For a benign P, F may be far simpler than d_VC predicts. The empirical R̂_n is sharper still: it depends on the specific training sample X₁,...,Xₙ, not just on P.

Empirical processes and machine learning

Empirical process theory provides the theoretical foundation for generalization guarantees in ML.

PAC learning — Theoretical foundation: PAC-learnability is equivalent to finite VC dimension
Deep learning — Neural network generalisation bounds via Rademacher complexity and weight norms
Multiple testing — FWER control through empirical processes and the Dvoretzky-Kiefer-Wolfowitz inequality

Итоги

Glivenko-Cantelli: sup_x|F̂_n(x)−F(x)| → 0 a.s. - uniform convergence of the empirical CDF
Empirical process G_n(f) = √n(P_nf − Pf); Donsker: G_n →d G (Gaussian process) when the entropy integral is finite
VC dimension d_VC(F): max shattering number; uniform LLN holds if and only if d_VC < ∞
Sauer-Shelah: m_F(n) ≤ (en/d)^d when d_VC = d - polynomial growth of the shatter function
VC inequality: sup|P_nf−Pf| ≤ C√(d_VC·log n / n) with high probability
Rademacher complexity R_n(F): distribution-dependent flexibility measure; bound ≤ 2R_n + √(log(1/δ)/(2n))
Dudley's inequality: R_n(F) ≤ (12/√n)·∫√(log N(ε))dε - connects covering numbers to generalization
PAC-learnability if and only if d_VC < ∞ (Blumer-Ehrenfeucht theorem)

Empirical Processes and the Glivenko-Cantelli Theorem

How is the Glivenko-Cantelli theorem stronger than the pointwise law of large numbers?

VC Dimension and Uniform Convergence (Donsker's Theorem)

Why does sin(kx) have infinite VC dimension despite depending on only one parameter k?

Function Classes and Rademacher Complexity

Why are generalization bounds via Rademacher complexity R_n sharper than bounds via VC dimension d_VC?

Итоги

Glivenko-Cantelli: sup_x|F̂_n(x)−F(x)| → 0 a.s. - uniform convergence of the empirical CDF

Empirical process G_n(f) = √n(P_nf − Pf); Donsker: G_n →d G (Gaussian process) when the entropy integral is finite

VC dimension d_VC(F): max shattering number; uniform LLN holds if and only if d_VC < ∞

Sauer-Shelah: m_F(n) ≤ (en/d)^d when d_VC = d - polynomial growth of the shatter function

VC inequality: sup|P_nf−Pf| ≤ C√(d_VC·log n / n) with high probability

Rademacher complexity R_n(F): distribution-dependent flexibility measure; bound ≤ 2R_n + √(log(1/δ)/(2n))

Dudley's inequality: R_n(F) ≤ (12/√n)·∫√(log N(ε))dε - connects covering numbers to generalization

PAC-learnability if and only if d_VC < ∞ (Blumer-Ehrenfeucht theorem)