Statistical Learning Theory

Algorithmic Stability

Why does early stopping work? Bousquet and Hardt (2016) gave the answer: SGD with a bounded number of steps is beta-stable, and this stability provides a generalization guarantee without explicit VC analysis.

  • **Early stopping:** Formally justified through beta-stability of SGD: fewer steps means smaller beta means a better generalization guarantee
  • **Differential privacy:** DP-SGD is formally stable by definition of privacy; this simultaneously bounds the generalization gap
  • **Meta-learning:** MAML uses stability for analyzing few-shot generalization: swapping one task should not drastically change meta-parameters
  • **AlphaFold 2:** DeepMind analyzes generalization of the model via SGD stability when training on 170,000 protein structures

Предварительные знания

  • Rademacher complexity
  • Agnostic PAC learning
  • Stochastic gradient descent
  • Rademacher Complexity

Algorithmic stability

DeepMind (2021) uses SGD stability in analyzing AlphaFold 2: training on 170,000 proteins with learning rate eta and T steps, the generalization error is bounded by 2*L*eta*T/m where L is the Lipschitz constant of the loss. This formally justifies early stopping.

What does uniform-stability of an algorithm A measure?

beta_n-stable: for all S, z, z': |L(A(S), z) - L(A(S^{i, z'}), z)| <= beta_n. The smaller beta_n, the less the algorithm overfits individual points. Bousquet-Elisseeff (2002) showed stability → generalisation: R(A(S)) - R_hat(A(S)) <= O(beta_n + sqrt(log(1/delta)/n)).

Stability and generalisation

How does beta_n-stability relate to generalisation?

Bousquet-Elisseeff: for a beta_n-stable algorithm E[R(A(S)) - R_hat(A(S))] <= 2 * beta_n. With high probability: |R - R_hat| <= O(beta_n + sqrt(log(1/delta)/n)). In the spirit of VC bounds but without an explicit hypothesis class - complexity is baked into the algorithm.

Examples: SGD and Tikhonov regularisation

Algorithmic stability as a doctor's robustness to replacing one patient - A doctor (algorithm) makes decisions based on a set of patients (training sample). A stable doctor: replacing one patient does not change diagnoses for the rest. An unstable doctor: one new patient reshapes the entire practice. Stability is the formal version of common sense: a good algorithm should not change drastically because of a single new example.

Why is SGD on a convex L-Lipschitz loss a stable algorithm?

Hardt-Recht-Singer (2016): SGD on convex L-Lipschitz functions has uniform stability beta_n = O(T*L²/n). With T = O(n) and eta = 1/sqrt(T): generalisation error = O(L²/sqrt(n)). For non-convex losses (DNNs) the bounds are weaker but the mechanism still works.

Connections

Algorithmic stability connects generalization theory with practical regularization methods and privacy-preserving algorithms.

  • Differential privacy — Related topic
  • L2 regularization — Related topic
  • Uniform stability vs Rademacher complexity — Related topic
  • Meta-learning — Related topic

Итоги

  • Beta-stability: replacing one example in S changes the loss at any point by at most beta
  • Bousquet-Elisseeff bound: L(h) no worse than L-hat_S(h) + beta + O((m*beta + M)*sqrt(log(1/delta)/m))
  • SGD: beta = 2L^2*eta*T/m, grows linearly with T, justifying early stopping
  • L2 regularization: beta = M^2/(2*lambda*m); increasing lambda improves stability
  • Stability is an alternative to VC analysis: guarantees through algorithm properties, not class properties

How does SGD stability change as the number of steps T increases (with fixed eta and m)?

SGD stability: beta = 2L^2*eta*T/m. Linear growth with T means early stopping (small T) improves stability and generalization.

Algorithmic Stability

0

1

Sign In