Statistical Learning Theory
Rademacher Complexity
VC-dimension of neural networks is enormous - billions of parameters. VC-theorem gives no useful bounds. Rademacher complexity sees the weight norm rather than the parameter count, and gives real estimates.
- **Deep learning:** Weight norm, not parameter count, determines Rademacher complexity of ResNet - explains generalization with billions of parameters
- **Regularization:** L2 regularization reduces weight norm, directly lowering R-hat and improving generalization guarantees via the Bartlett-Mendelson theorem
- **PAC-Bayes:** Bayesian analogue of the Rademacher approach for stochastic classifiers - used to analyze dropout networks
- **Spectral normalization in GANs:** Controlling the spectral norm of GAN layers controls Rademacher complexity of the discriminator - ensures training stability
Предварительные знания
- Agnostic PAC learning
- Uniform convergence
- Union bound and Hoeffding inequality
Definition of Rademacher complexity
Bartlett and Mendelson (2002) proved: the weight norm of a neural network, not the number of parameters, controls generalization. ResNet-50 with 25 million parameters generalizes due to implicit regularization: Rademacher complexity of the effective hypothesis is small despite the large parameter count.
What does the empirical Rademacher complexity R_n(F) measure?
R_n(F) = E_sigma[sup_{f ∈ F} (1/n) sum_i sigma_i f(x_i)]. The more a class can correlate with random noise, the higher its complexity. This is the distribution-aware replacement for VC dimension: it accounts for the actual data distribution.
Generalisation bounds via Rademacher
What generalisation bound does Rademacher complexity give?
Mohri-Rostamizadeh-Talwalkar (2018): for any h ∈ F, with probability 1-delta over a sample of size n: R(h) <= R_hat(h) + 2*R_n(F) + sqrt(log(2/delta)/(2n)). Rademacher captures class complexity; the last term is the concentration correction via McDiarmid.
Computing Rademacher complexity
Rademacher complexity as measuring a class's capacity to memorize random noise - A teacher shows students randomly labeled problems. If a student can perfectly explain random labels, they memorized rather than understood. Rademacher complexity measures this memorization capacity. A class with high Rademacher complexity can explain any noise pattern - meaning it is insufficiently constrained for reliable generalization.
What bound relates Rademacher complexity to VC dimension?
Combining Sauer-Shelah (growth function bound (en/d)^d) with Massart's lemma yields R_n(F) <= O(sqrt(d * log(n/d) / n)). An analogue of Vapnik-Chervonenkis bounds with an explicit constant. For Lipschitz classes the bound becomes O(L/sqrt(n)).
Connections
Rademacher complexity connects classical generalization theory with modern neural network analysis and Bayesian methods.
- VC theory — Related topic
- Spectral theory of neural networks — Related topic
- PAC-Bayes — Related topic
- Implicit regularization of SGD — Related topic
Итоги
- R-hat_S(F): expected correlation with Rademacher noise, measures expressiveness of F on the specific sample S
- Symmetrization: L(f) at most L-hat_S(f) + 2*R-hat_S(F) + O(sqrt(log(1/delta)/m))
- Contraction lemma: R(phi composed with F) at most L_phi times R(F) - enables layer-wise analysis of neural networks
- Neural networks: R depends on product of weight norms, not number of parameters
- Connection to VC: R_m(F) at most O(sqrt(VC(F)/m)), but on specific data can be much smaller
Why is Rademacher complexity better than VC-dimension for analyzing neural networks?
Rademacher complexity accounts for the specific sample and hypothesis norms. This allows analysis of neural networks with large parameter counts where VC-dim gives weak guarantees.