Statistical Learning Theory
Margin-based bounds: why SVM generalizes
1995: Vapnik and Cortes publish SVM. The classifier operates in an infinite-dimensional feature space - but VC theory predicts catastrophe. Classical learning theory says that as the hypothesis space grows, generalization gets worse. Infinite dimension should mean infinite risk. Yet SVM works. In 1998, Bartlett explained why: the question was being asked wrong. The right parameter is not the dimension of the space, but the distance from the classifier to the data.
- Face recognition: a network operating on millions of pixel dimensions generalizes well because the margin on training data is large - the bound depends on margin, not dimension.
- Spam filters in the early 2000s: SVMs with bag-of-words kernels generalized to new emails without overfitting, a direct consequence of margin-based regularization.
- Modern deep learning: SGD's implicit bias controls an effective margin - spectral norm bounds explain why overparameterized networks generalize despite having more parameters than training examples.
Предварительные знания
- Rademacher complexity - a data-dependent measure of hypothesis class expressiveness
- Linear classifier: h(x) = sign(w·x + b)
- Inner products in Hilbert spaces
Geometric margin: what the classifier actually maximizes
A linear classifier h(x) = sign(w·x + b) cuts feature space with a hyperplane. Every training point sits on one side of that hyperplane at some distance from it. The critical quantity turns out to be the distance to the nearest point - the geometric margin.
The functional margin of point (x_i, y_i) is y_i (w·x_i + b). It is positive when the classifier gets the label right and negative otherwise. The geometric margin normalizes by ||w||: γ_i = y_i (w·x_i + b) / ||w||. For the full training set, the margin is γ = min_i y_i (w·x_i) / ||w||. Hard-margin SVM searches for the hyperplane that maximizes γ.
For a unit-norm weight vector ||w|| = 1: γ = min_i y_i (w·x_i + b) Hard-margin SVM primal: max_{w,b} γ s.t. y_i (w·x_i + b) >= γ for all i Equivalent form (fix γ = 1, minimize ||w||^2): min_{w,b} ||w||^2 s.t. y_i (w·x_i + b) >= 1 for all i Maximizing margin <=> minimizing ||w||^2.
Support vectors are exactly the training points that lie on the margin boundary - at distance 1/||w|| from the hyperplane. All other points are farther away and play no role in the solution. This is not a computational shortcut; it is the geometric structure that drives generalization.
Hard-margin SVM minimizes ||w||^2. What does this imply for the geometric margin γ = 1/||w||?
Bartlett 1998: generalization through margin
In 1998, Peter Bartlett proved a result that reframed how people understood SVM. Before his work, generalization was analyzed through VC dimension: the larger the hypothesis class, the weaker the bound. For SVM in an infinite-dimensional RKHS, VC analysis predicts infinite risk. Bartlett showed that the right quantity is not the dimension of the hypothesis space, but the margin γ.
Margin bound (Bartlett, 1998): R(h) <= R̂_γ(h) + O( ||w|| / (γ · sqrt(m)) ) where: R(h) - true risk (test error) R̂_γ(h) - margin loss on training set = (1/m) #{i : y_i (w·x_i) < γ} (fraction inside the margin) ||w|| - norm of the weight vector γ - margin (distance to the nearest training point) m - training set size Critical: the input dimension d does not appear anywhere in this bound. Normalized form (||w|| = 1): R(h) <= R̂_γ(h) + O( 1 / (γ · sqrt(m)) )
The proof goes through Rademacher complexity. The goal is to bound Rad_m(H_γ) - the Rademacher complexity of the class of classifiers achieving margin at least γ. Via covering numbers (the size of an epsilon-net of H_γ) and Massart's lemma, one gets Rad_m(H_γ) = O(||w|| / (γ sqrt(m))). The final bound follows from the standard Rademacher theorem: R(h) <= R̂(h) + 2 Rad_m(H) + O(sqrt(log(1/δ)/m)).
Now look at what SVM actually does. Hard-margin SVM minimizes ||w||^2 subject to y_i(w·x_i) >= 1 for all i. This is precisely minimizing the numerator ||w|| in the bound while keeping R̂_γ(h) = 0. Soft-margin SVM with parameter C trades off ||w||^2 against the number of margin violations - a direct optimization of the generalization bound.
As C grows, the SVM squeezes training points closer to the hyperplane, margin shrinks, ||w|| grows, and the generalization bound deteriorates. Cross-validation over C works because it approximately searches for the minimum of the real generalization bound.
Two classifiers are trained on the same dataset (m = 1000). Classifier A: ||w|| = 0.5, margin γ = 0.5. Classifier B: ||w|| = 2, margin γ = 0.1. What do the Bartlett bound terms look like?
RKHS, the kernel trick, and spectral norm bounds for deep nets
The kernel trick lifts input vectors into a feature space H through a map φ: R^d -> H, then replaces the inner product with a kernel function: k(x, x') = φ(x)·φ(x'). The space H can be infinite-dimensional - the RBF kernel k(x,x') = exp(-||x-x'||^2 / (2σ^2)) corresponds to a Gaussian feature space with infinitely many dimensions.
Before Bartlett, this caused a theoretical headache: how can a classifier in infinite dimensions generalize? VC analysis gives infinity. The margin bound resolves this cleanly: in RKHS the margin is γ = min_i y_i (w·φ(x_i)) / ||w||_H, and the bound depends only on ||w||_H and γ. The dimension of H is irrelevant.
Mercer's theorem: k(x,x') is a valid kernel iff there exist H and φ: R^d -> H such that k(x,x') = <φ(x), φ(x')>_H. SVM in RKHS: min_{w in H} ||w||_H^2 s.t. y_i <w, φ(x_i)>_H >= 1 for all i Representer theorem: w* = sum_i α_i y_i φ(x_i) (support vectors only, α_i > 0) Prediction: f(x) = <w*, φ(x)>_H = sum_i α_i y_i k(x_i, x) ||w*||_H^2 = sum_{i,j} α_i α_j y_i y_j k(x_i, x_j) Everything is computed through k - φ never needs to be materialized. Margin bound in RKHS: R(h) <= R̂_γ(h) + O( sqrt(sum_{i,j} α_i α_j k(x_i,x_j)) / (γ sqrt(m)) )
Bartlett, Foster, and Telgarsky (2017) pushed margin theory into deep networks. For a network with L layers and weight matrices W_1,...,W_L the generalization error is bounded through the product of spectral norms sigma_max(W_l) and the Frobenius-to-spectral ratio, divided by the training margin. This explains why overparameterized networks with a large margin generalize despite having far more parameters than training examples.
Spectral norm bound (Bartlett-Foster-Telgarsky 2017): R(h) <= O( (prod_l sigma_max(W_l)) * (sum_l ||W_l||_F^2 / sigma_max(W_l)^2)^(1/2) / (gamma * sqrt(m)) ) where: sigma_max(W_l) - largest singular value of layer l weight matrix ||W_l||_F - Frobenius norm (square root of sum of squared weights) gamma - network margin on the training set m - training set size The product of spectral norms plays the role of ||w|| in the SVM bound. SGD with weight decay implicitly controls spectral norms - this is one mechanism behind the generalization of modern deep nets.
Why can an SVM with an RBF kernel (operating in an infinite-dimensional RKHS) achieve finite generalization error without contradicting learning theory?
Key takeaways
- Margin γ = min_i y_i (w·x_i) / ||w|| - the distance from the hyperplane to the nearest training point.
- Bartlett 1998: R(h) <= R̂_γ(h) + O(||w|| / (γ sqrt(m))) - the bound does not depend on input dimension.
- SVM minimizes ||w||^2 subject to correct classification - this is a direct minimization of the generalization bound.
- In RKHS, the bound stays finite through ||w||_H = sqrt(sum_ij α_i α_j k(x_i, x_j)) - the kernel trick is theoretically grounded.
- Spectral norm bounds (Bartlett-Foster-Telgarsky 2017) extend margin theory to deep networks via products of layer-wise spectral norms.
Position in learning theory
Margin bounds bridge classical VC theory and modern generalization analysis. VC theory looks at the size of the hypothesis class; margin bounds look at the geometry of the solution relative to the data. This shift made it possible to analyze SVM, kernel methods, and eventually deep networks within a single framework.
- Rademacher complexity — The margin bound proof goes through Rademacher - we bound Rad_m(H_γ) via covering numbers of the margin class
- VC theory — Margin bounds replace VC bounds when VC dimension is infinite (RKHS, deep networks)
- Boosting and AdaBoost — AdaBoost also maximizes the training margin - this is the mechanism behind its resistance to overfitting
Вопросы для размышления
- If margin γ is cut in half, how does the Bartlett bound change? What does that imply for the number of training examples needed to achieve the same generalization guarantee?
- Soft-margin SVM with parameter C trades off ||w||^2 against margin violations. How does this correspond to minimizing the generalization bound?
- Why is the spectral norm bound for deep networks analogous to the margin bound for SVM? What plays the role of ||w|| in the deep learning case?
- Can a classifier with zero margin (γ = 0) on the training set still generalize well? What does the theory say?