Statistical Learning Theory

Rademacher Complexity

VC-dimension of neural networks is enormous - billions of parameters. VC-theorem gives no useful bounds. Rademacher complexity sees the weight norm rather than the parameter count, and gives real estimates.

**Deep learning:** Weight norm, not parameter count, determines Rademacher complexity of ResNet - explains generalization with billions of parameters
**Regularization:** L2 regularization reduces weight norm, directly lowering R-hat and improving generalization guarantees via the Bartlett-Mendelson theorem
**PAC-Bayes:** Bayesian analogue of the Rademacher approach for stochastic classifiers - used to analyze dropout networks
**Spectral normalization in GANs:** Controlling the spectral norm of GAN layers controls Rademacher complexity of the discriminator - ensures training stability

Предварительные знания

Agnostic PAC learning
Uniform convergence
Union bound and Hoeffding inequality

PAC Learning Extensions

Definition of Rademacher complexity

Bartlett and Mendelson (2002) proved: the weight norm of a neural network, not the number of parameters, controls generalization. ResNet-50 with 25 million parameters generalizes due to implicit regularization: Rademacher complexity of the effective hypothesis is small despite the large parameter count.

What does the empirical Rademacher complexity R_n(F) measure?

R_n(F) = E_sigma[sup_{f ∈ F} (1/n) sum_i sigma_i f(x_i)]. The more a class can correlate with random noise, the higher its complexity. This is the distribution-aware replacement for VC dimension: it accounts for the actual data distribution.

Generalisation bounds via Rademacher

What generalisation bound does Rademacher complexity give?

Mohri-Rostamizadeh-Talwalkar (2018): for any h ∈ F, with probability 1-delta over a sample of size n: R(h) <= R_hat(h) + 2*R_n(F) + sqrt(log(2/delta)/(2n)). Rademacher captures class complexity; the last term is the concentration correction via McDiarmid.

Computing Rademacher complexity

Rademacher complexity as measuring a class's capacity to memorize random noise - A teacher shows students randomly labeled problems. If a student can perfectly explain random labels, they memorized rather than understood. Rademacher complexity measures this memorization capacity. A class with high Rademacher complexity can explain any noise pattern - meaning it is insufficiently constrained for reliable generalization.

What bound relates Rademacher complexity to VC dimension?

Combining Sauer-Shelah (growth function bound (en/d)^d) with Massart's lemma yields R_n(F) <= O(sqrt(d * log(n/d) / n)). An analogue of Vapnik-Chervonenkis bounds with an explicit constant. For Lipschitz classes the bound becomes O(L/sqrt(n)).

Connections

Rademacher complexity connects classical generalization theory with modern neural network analysis and Bayesian methods.

VC theory — Related topic
Spectral theory of neural networks — Related topic
PAC-Bayes — Related topic
Implicit regularization of SGD — Related topic

Итоги

R-hat_S(F): expected correlation with Rademacher noise, measures expressiveness of F on the specific sample S
Symmetrization: L(f) at most L-hat_S(f) + 2*R-hat_S(F) + O(sqrt(log(1/delta)/m))
Contraction lemma: R(phi composed with F) at most L_phi times R(F) - enables layer-wise analysis of neural networks
Neural networks: R depends on product of weight norms, not number of parameters
Connection to VC: R_m(F) at most O(sqrt(VC(F)/m)), but on specific data can be much smaller

Why is Rademacher complexity better than VC-dimension for analyzing neural networks?

Rademacher complexity accounts for the specific sample and hypothesis norms. This allows analysis of neural networks with large parameter counts where VC-dim gives weak guarantees.

Definition of Rademacher complexity

What does the empirical Rademacher complexity R_n(F) measure?

Generalisation bounds via Rademacher

What generalisation bound does Rademacher complexity give?

Computing Rademacher complexity

What bound relates Rademacher complexity to VC dimension?

Итоги

R-hat_S(F): expected correlation with Rademacher noise, measures expressiveness of F on the specific sample S

Symmetrization: L(f) at most L-hat_S(f) + 2*R-hat_S(F) + O(sqrt(log(1/delta)/m))

Contraction lemma: R(phi composed with F) at most L_phi times R(F) - enables layer-wise analysis of neural networks

Neural networks: R depends on product of weight norms, not number of parameters

Connection to VC: R_m(F) at most O(sqrt(VC(F)/m)), but on specific data can be much smaller