Statistical Learning Theory
Algorithmic Stability
Why does early stopping work? Bousquet and Hardt (2016) gave the answer: SGD with a bounded number of steps is beta-stable, and this stability provides a generalization guarantee without explicit VC analysis.
- **Early stopping:** Formally justified through beta-stability of SGD: fewer steps means smaller beta means a better generalization guarantee
- **Differential privacy:** DP-SGD is formally stable by definition of privacy; this simultaneously bounds the generalization gap
- **Meta-learning:** MAML uses stability for analyzing few-shot generalization: swapping one task should not drastically change meta-parameters
- **AlphaFold 2:** DeepMind analyzes generalization of the model via SGD stability when training on 170,000 protein structures
Предварительные знания
- Rademacher complexity
- Agnostic PAC learning
- Stochastic gradient descent
Algorithmic stability
DeepMind (2021) uses SGD stability in analyzing AlphaFold 2: training on 170,000 proteins with learning rate eta and T steps, the generalization error is bounded by 2*L*eta*T/m where L is the Lipschitz constant of the loss. This formally justifies early stopping.
What does uniform-stability of an algorithm A measure?
beta_n-stable: for all S, z, z': |L(A(S), z) - L(A(S^{i, z'}), z)| <= beta_n. The smaller beta_n, the less the algorithm overfits individual points. Bousquet-Elisseeff (2002) showed stability → generalisation: R(A(S)) - R_hat(A(S)) <= O(beta_n + sqrt(log(1/delta)/n)).
Stability and generalisation
How does beta_n-stability relate to generalisation?
Bousquet-Elisseeff: for a beta_n-stable algorithm E[R(A(S)) - R_hat(A(S))] <= 2 * beta_n. With high probability: |R - R_hat| <= O(beta_n + sqrt(log(1/delta)/n)). In the spirit of VC bounds but without an explicit hypothesis class - complexity is baked into the algorithm.
Examples: SGD and Tikhonov regularisation
Algorithmic stability as a doctor's robustness to replacing one patient - A doctor (algorithm) makes decisions based on a set of patients (training sample). A stable doctor: replacing one patient does not change diagnoses for the rest. An unstable doctor: one new patient reshapes the entire practice. Stability is the formal version of common sense: a good algorithm should not change drastically because of a single new example.
Why is SGD on a convex L-Lipschitz loss a stable algorithm?
Hardt-Recht-Singer (2016): SGD on convex L-Lipschitz functions has uniform stability beta_n = O(T*L²/n). With T = O(n) and eta = 1/sqrt(T): generalisation error = O(L²/sqrt(n)). For non-convex losses (DNNs) the bounds are weaker but the mechanism still works.
Connections
Algorithmic stability connects generalization theory with practical regularization methods and privacy-preserving algorithms.
- Differential privacy — Related topic
- L2 regularization — Related topic
- Uniform stability vs Rademacher complexity — Related topic
- Meta-learning — Related topic
Итоги
- Beta-stability: replacing one example in S changes the loss at any point by at most beta
- Bousquet-Elisseeff bound: L(h) no worse than L-hat_S(h) + beta + O((m*beta + M)*sqrt(log(1/delta)/m))
- SGD: beta = 2L^2*eta*T/m, grows linearly with T, justifying early stopping
- L2 regularization: beta = M^2/(2*lambda*m); increasing lambda improves stability
- Stability is an alternative to VC analysis: guarantees through algorithm properties, not class properties
How does SGD stability change as the number of steps T increases (with fixed eta and m)?
SGD stability: beta = 2L^2*eta*T/m. Linear growth with T means early stopping (small T) improves stability and generalization.