Statistical Learning Theory

Rademacher complexity: data-dependent complexity

VC-theory looks at the worst case. Rademacher looks at the actual data, the actual architecture, the actual noise. 2002. Bartlett and Mendelson rewrite generalization theory with one question: how well does a hypothesis class fit random signs on these specific training points? If poorly, the class is simple on this data, regardless of how many VC-dimensions it has.

**SVM with RBF kernel:** VC-dim is infinite, yet the margin bound via Rademacher still yields a non-vacuous guarantee - that is precisely why soft-margin SVM works in practice
**Zhang et al. 2017 'Rethinking generalization':** ResNet-50 perfectly memorizes random labels on CIFAR-10 (zero training error). The Rademacher complexity of the class under such training is maximal - yet on real labels the network generalizes. This gap exposes a hole in classical bounds
**Empirical Rademacher is estimated in one pass:** randomly label the sample with signs $\pm 1$, train the model on this random labeling, measure training accuracy - done
**Norm-based bounds for neural nets** (Bartlett 2017, Neyshabur 2018) express Rademacher through Frobenius norms of weight matrices - the first attempt to obtain non-vacuous bounds for deep networks

Предварительные знания

VC-dimension and shattering as a distribution-free measure of complexity
Empirical risk $\hat{R}(h) = \frac{1}{m}\sum_i \mathbf{1}[h(x_i) \neq y_i]$
Hoeffding inequality for concentration of the mean

Empirical Rademacher: how well H fits noise

Given a sample $S = (x_1, \ldots, x_m)$. Draw $m$ independent random signs $\sigma_1, \ldots, \sigma_m \in \{-1, +1\}$, each with probability $1/2$. These signs are noise, unrelated to the true labels. The question: how well can class $\mathcal{H}$ correlate with random signs on this specific sample?

Inside the supremum: the class searches for the hypothesis most aligned with random signs. Outside the expectation: the average is taken over all ways to draw $\sigma$. The result is a number in $[0, 1]$. Zero means the class cannot adapt to noise on $S$ at all. One means that for any random noise some hypothesis realizes it perfectly.

**Key inversion of perspective.** VC asks: "what is the maximum noise the class can theoretically express?" Rademacher asks: "what noise does it actually express on these $m$ points?" The first is combinatorics, the second is a measurable quantity. A class with $\text{VCdim} = \infty$ may have $\hat{\mathfrak{R}}_S(\mathcal{H}) \to 0$ on structured data - that is the data-dependence.

The same class of linear classifiers gives a different $\hat{\mathfrak{R}}_S$ on different data - exactly the precision VC-theory lacks by construction. On compact data the linear class is simple; on diffuse data, complex.

Class $\mathcal{H}$ contains a single constant function $h(x) \equiv 1$. What is $\hat{\mathfrak{R}}_S(\mathcal{H})$ for any sample $S$ of size $m$?

Expected Rademacher and the generalization bound

The empirical Rademacher $\hat{\mathfrak{R}}_S(\mathcal{H})$ depends on a specific sample. Taking the expectation over the sample yields the **expected Rademacher complexity**:

This quantity depends on the distribution $\mathcal{D}$ - that is the meaning of "data-dependent". The same class on the MNIST cats distribution and on a Gaussian noise distribution will have different $\mathfrak{R}_m$.

**Bartlett-Mendelson theorem 2002.** For a class $\mathcal{H}$ of functions with values in $[0, 1]$ and any $\delta \in (0, 1)$, with probability at least $1 - \delta$ over a sample $S$ of size $m$, for all $h \in \mathcal{H}$: $$R(h) \leq \hat{R}(h) + 2\mathfrak{R}_m(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2m}}$$ In the data-dependent variant $\mathfrak{R}_m$ is replaced by $\hat{\mathfrak{R}}_S$ - estimated directly from the training sample.

The bound has the same shape as the VC variant: empirical risk plus complexity penalty plus concentration term. The principal difference is that the penalty $2\mathfrak{R}_m(\mathcal{H})$ depends on $\mathcal{D}$. For distributions on which the class is "actually simple" the penalty is small, even when VC-dim is infinite.

**Margin bounds for SVM (Bartlett-Mendelson 2002, Theorem 21).** Let $\mathcal{H}$ be the class of linear classifiers with $\|w\|_2 \leq B$ on data with $\|x\|_2 \leq R$. Then $\mathfrak{R}_m(\mathcal{H}) \leq BR/\sqrt{m}$. Plugging into the general bound yields the classical margin bound: error $\leq$ fraction of points with margin below $\gamma$ + $O(BR/(\gamma\sqrt{m}))$. **VC-dimension does not appear in this formula at all** - which is critical for kernel SVM where VC is infinite.

Class $\mathcal{H}$ has $\mathfrak{R}_m(\mathcal{H}) = 0.1$ at $m = 1000$. The empirical risk of $\hat{h}$ is $0.05$. What upper bound on true error does Rademacher give at $\delta = 0.05$?

Rademacher for neural nets: vacuous bounds and rethinking

Bartlett 1998 produces the first Rademacher bound for neural networks via the product of Frobenius norms of weight matrices: $\mathfrak{R}_m \leq \frac{c}{\sqrt{m}} \prod_{l=1}^{L} \|W_l\|_F$. For modern architectures this product is enormous - the bound exceeds one. The bound is correct but vacuous: it states "error is at most 100%", which is useless.

Neyshabur et al. 2018 sharpen via spectral norms and path norms - the bound becomes tighter, yet still orders of magnitude above empirical error. Dziugaite-Roy 2017 try PAC-Bayes - the first non-vacuous bound for a CNN on MNIST: a prediction of $0.165$ versus actual error $0.013$. A breakthrough and simultaneously the boundary of what theory can currently express.

Hence in modern work Rademacher complexity is increasingly replaced with margin-Rademacher (Bartlett 2017), data-dependent norm bounds, or PAC-Bayesian analogs. All of them are derivatives of the same idea: measure complexity through correlation with noise, but anchor the measurement to something narrower than the entire class.

Why is the classical Rademacher bound for ResNet-50 on ImageNet vacuous (greater than 1)?

Key takeaways

**Empirical Rademacher** $\hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_\sigma[\sup_h \frac{1}{m}\sum \sigma_i h(x_i)]$ - how well the class correlates with random signs on a specific sample
**Expected Rademacher** $\mathfrak{R}_m(\mathcal{H}) = \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{H})]$ - the average over distribution $\mathcal{D}$, the key difference from distribution-free VC
**Generalization bound (Bartlett-Mendelson 2002):** $R(h) \leq \hat{R}(h) + 2\mathfrak{R}_m(\mathcal{H}) + \sqrt{\log(1/\delta)/(2m)}$
**Margin bounds for SVM:** $\mathfrak{R}_m \leq BR/\sqrt{m}$ for $\|w\| \leq B$, $\|x\| \leq R$ - VC does not appear, which is critical for kernel methods
**Neural networks:** norm-based bounds via Frobenius norms are vacuous for modern architectures; Zhang et al. 2017 demonstrated that the class fits random labels perfectly
**Modern frontier:** algorithm-dependent bounds, PAC-Bayes, margin-Rademacher - efforts to narrow the class via the data or the optimization process

What comes next

Rademacher complexity is the central complexity measure in modern generalization theory:

Uniform convergence — The principal bound in Vapnik-Chervonenkis theory is expressed directly through $\mathfrak{R}_m(\mathcal{H})$
Margin bounds — SVM, boosting, and neural networks are analyzed through margin-Rademacher complexity
PAC-Bayes — Alternative theory yielding tighter bounds via distributions over hypotheses
Deep generalization — Why classical Rademacher bounds are vacuous for overparameterized networks
Concentration inequalities — McDiarmid and Hoeffding - the technical toolkit for going from Rademacher to generalization

Вопросы для размышления

The class of linear classifiers with $\|w\| \leq 1$ has $\mathfrak{R}_m \sim 1/\sqrt{m}$ regardless of dimension $d$. The VC-bound gives $O(\sqrt{d/m})$. Why does the data-dependent estimate avoid the curse of dimensionality while the distribution-free one does not?
Empirical Rademacher is estimated via simulation: randomly label the sample $\pm 1$, train the model, measure training accuracy. What problems arise when estimating $\hat{\mathfrak{R}}_S$ for complex classes such as neural networks, and how does this connect to the Zhang et al. 2017 experiment?
The margin bound for kernel SVM contains $BR/\sqrt{m}$ without VC-dim. Why is this what allows kernel methods with infinite VC-dim to actually work in real applications?

Связанные уроки

lt-04-vc-dimension — VC-dim is the distribution-free predecessor; Rademacher refines it with a data-dependent estimate
lt-05-growth-sauer — Growth function via Sauer's lemma is a combinatorial ceiling; Rademacher gives a finer estimate
lt-07-uniform-convergence — Uniform convergence bounds are expressed directly through Rademacher complexity
lt-11-margin-bounds — Margin bounds for SVM and neural networks follow directly from Rademacher analysis
lt-13-deep-generalization — Norm-based bounds for neural networks via Frobenius weight norms are direct descendants of Rademacher bounds
prob-22-concentration — McDiarmid and Hoeffding inequalities translate Rademacher into a generalization bound
stat-01-sampling