Statistical Learning Theory

Information-Theoretic Learning Theory

Why does noisy SGD generalize better than an exact solver with the same training error? The information-theoretic answer: noise reduces I(W;S) - the amount of information about the sample stored in the model weights.

  • **Langevin SGD:** Adding Gaussian noise to gradients bounds I(W;S) - a theoretically grounded reason why noisy SGD generalizes better than exact methods
  • **Dropout in neural networks:** Dropout implicitly reduces I(W;S): random neuron deactivation prevents weights from memorizing specific examples
  • **Differential privacy:** DP-SGD adds noise for privacy; this simultaneously bounds I(W;S) and gives generalization guarantees via the information-theoretic bound
  • **PAC-Bayes for transformers:** McAllester's KL bound applies to analyzing fine-tuning of large language models: small KL(posterior || prior) means good generalization

Предварительные знания

  • Compression-based bounds
  • Mutual information and entropy
  • SGD and stochastic optimization
  • Compression-Based Bounds

Mutual-information bounds

Daniel McAllester in 2017 showed: the generalization gap of algorithm A is bounded by I(W; S), the mutual information between weights W and training set S. For Langevin SGD with step size eta and noise sigma^2: I(W; S) is at most eta^2 m / sigma^2. This explains why noisy SGD generalizes better than exact solvers at the same training error.

What does the MI generalisation bound say?

Xu-Raginsky (2017): |E[R(W) - R_hat(W)]| <= sqrt(2*sigma² * I(W; S) / n), where I(W; S) is the mutual information between trained weights and training sample. If the algorithm "remembers" little about S, generalisation is good. Explains why compression and stochasticity improve generalisation.

KL bounds and information bottleneck

What objective does the Information Bottleneck (IB) optimise for representation learning?

Tishbys IB principle (1999, deep IB 2017): the representation T should be a minimal sufficient statistic for predicting Y. Compress X (low I(X; T)) but preserve predictive power for Y (high I(T; Y)). Tishby-Shwartz argued via IB that DNN training has two phases - fitting (max I(T; Y)) and compression (min I(X; T)).

PAC-Bayes: a Bayesian perspective

I(W;S) as fingerprints left on the model weights - A forensic investigator checks whether traces of specific victims (training examples) remain on the weapon (model weights). Few traces (small I) means the model did not memorize the sample. Clear traces (large I) means overfitting. Small I(W;S) does not mean a bad model - it means the model extracted general patterns rather than memorizing specific examples.

What does the McAllester PAC-Bayes bound state?

McAllester (1998): for prior P and posterior Q over hypotheses, with probability 1-delta: the average risk under Q is bounded by average empirical risk plus sqrt(KL(Q||P)/n). KL replaces "class size" more finely: a posterior Q can concentrate, but the penalty depends on its distance from the prior.

Connections

The information-theoretic approach connects generalization theory with information theory, differential privacy, and PAC-Bayes.

  • PAC-Bayes — Related topic
  • Differential privacy — Related topic
  • Algorithmic stability — Related topic
  • Minimum description length (MDL) — Related topic

Итоги

  • I(W;S) - mutual information between weights W and sample S: smaller means better generalization
  • Xu-Raginsky bound: |gen| at most sqrt(2*sigma^2*I(W;S)/m) - generalization is guaranteed when I is small
  • CMI bound of Steinke-Zakynthinou: sharper via conditional mutual information
  • Langevin SGD: noise sigma reduces I(W;S) - formal explanation why noisy SGD generalizes better
  • PAC-Bayes: KL(Q||P) plays the role of I(W;S) in stochastic classifiers

What does small I(W; S) mean for generalization?

I(W;S) close to 0 means W is statistically independent of S. The algorithm learned general patterns rather than memorizing specific examples.

Information-Theoretic Learning Theory

0

1

Sign In