Statistical Learning Theory

Algorithmic Stability

Why does early stopping work? Bousquet and Hardt (2016) gave the answer: SGD with a bounded number of steps is beta-stable, and this stability provides a generalization guarantee without explicit VC analysis.

**Early stopping:** Formally justified through beta-stability of SGD: fewer steps means smaller beta means a better generalization guarantee
**Differential privacy:** DP-SGD is formally stable by definition of privacy; this simultaneously bounds the generalization gap
**Meta-learning:** MAML uses stability for analyzing few-shot generalization: swapping one task should not drastically change meta-parameters
**AlphaFold 2:** DeepMind analyzes generalization of the model via SGD stability when training on 170,000 protein structures

Предварительные знания

Rademacher complexity
Agnostic PAC learning
Stochastic gradient descent

Rademacher Complexity

Algorithmic stability

DeepMind (2021) uses SGD stability in analyzing AlphaFold 2: training on 170,000 proteins with learning rate eta and T steps, the generalization error is bounded by 2*L*eta*T/m where L is the Lipschitz constant of the loss. This formally justifies early stopping.

What does uniform-stability of an algorithm A measure?

beta_n-stable: for all S, z, z': |L(A(S), z) - L(A(S^{i, z'}), z)| <= beta_n. The smaller beta_n, the less the algorithm overfits individual points. Bousquet-Elisseeff (2002) showed stability → generalisation: R(A(S)) - R_hat(A(S)) <= O(beta_n + sqrt(log(1/delta)/n)).

Stability and generalisation

How does beta_n-stability relate to generalisation?

Bousquet-Elisseeff: for a beta_n-stable algorithm E[R(A(S)) - R_hat(A(S))] <= 2 * beta_n. With high probability: |R - R_hat| <= O(beta_n + sqrt(log(1/delta)/n)). In the spirit of VC bounds but without an explicit hypothesis class - complexity is baked into the algorithm.

Examples: SGD and Tikhonov regularisation

Algorithmic stability as a doctor's robustness to replacing one patient - A doctor (algorithm) makes decisions based on a set of patients (training sample). A stable doctor: replacing one patient does not change diagnoses for the rest. An unstable doctor: one new patient reshapes the entire practice. Stability is the formal version of common sense: a good algorithm should not change drastically because of a single new example.

Why is SGD on a convex L-Lipschitz loss a stable algorithm?

Hardt-Recht-Singer (2016): SGD on convex L-Lipschitz functions has uniform stability beta_n = O(T*L²/n). With T = O(n) and eta = 1/sqrt(T): generalisation error = O(L²/sqrt(n)). For non-convex losses (DNNs) the bounds are weaker but the mechanism still works.

Connections

Algorithmic stability connects generalization theory with practical regularization methods and privacy-preserving algorithms.

Differential privacy — Related topic
L2 regularization — Related topic
Uniform stability vs Rademacher complexity — Related topic
Meta-learning — Related topic

Итоги

Beta-stability: replacing one example in S changes the loss at any point by at most beta
Bousquet-Elisseeff bound: L(h) no worse than L-hat_S(h) + beta + O((m*beta + M)*sqrt(log(1/delta)/m))
SGD: beta = 2L^2*eta*T/m, grows linearly with T, justifying early stopping
L2 regularization: beta = M^2/(2*lambda*m); increasing lambda improves stability
Stability is an alternative to VC analysis: guarantees through algorithm properties, not class properties

How does SGD stability change as the number of steps T increases (with fixed eta and m)?

SGD stability: beta = 2L^2*eta*T/m. Linear growth with T means early stopping (small T) improves stability and generalization.

Algorithmic stability

What does uniform-stability of an algorithm A measure?

Stability and generalisation

How does beta_n-stability relate to generalisation?

Examples: SGD and Tikhonov regularisation

Why is SGD on a convex L-Lipschitz loss a stable algorithm?

Итоги

Beta-stability: replacing one example in S changes the loss at any point by at most beta

Bousquet-Elisseeff bound: L(h) no worse than L-hat_S(h) + beta + O((m*beta + M)*sqrt(log(1/delta)/m))

SGD: beta = 2L^2*eta*T/m, grows linearly with T, justifying early stopping

L2 regularization: beta = M^2/(2*lambda*m); increasing lambda improves stability

Stability is an alternative to VC analysis: guarantees through algorithm properties, not class properties