Statistical Learning Theory

PAC-Bayes generalization bounds

In 2022 a Google Brain team applied PAC-Bayes to transformers with 110M parameters and obtained the first non-trivial theoretical generalization guarantees for large neural networks. McAllester (1999) created a tool that works where classical VC theory gives useless bounds.

Dziugaite et al. (NetSafe 2021) used PAC-Bayes certification for safety guarantees in autonomous systems.
Norm-based PAC-Bayes bounds from Microsoft Research explained why overparameterized networks don't overfit.

McAllester bound and improvements

In 1999 David McAllester proved that for any prior P and posterior Q over hypotheses, with probability >= 1-delta over the training sample of size n: expected risk <= emp. risk + sqrt((KL(Q||P) + log(n/delta)) / (2n)). This is the first bound that can improve with more parameters given the right prior. In 2022 Dziugaite et al. applied PAC-Bayes to ResNet-50 and obtained non-trivial bounds of 0.61 at actual error 0.24.

In the PAC-Bayes bound: if prior P is very wide (large variance), then KL(Q||P)...

PAC-Bayes for neural networks

Dziugaite and Roy (2017) applied PAC-Bayes to neural networks with Gaussian weights. They directly minimized the PAC-Bayes bound via SGD, simultaneously training and certifying generalization. On MNIST they obtained a guaranteed error bound of 16% at actual 2%.

The PAC-Bayes bound is non-trivial (< 1) only when...

Key results

PAC-Bayes: true risk <= emp. risk + sqrt((KL(Q||P) + log(n/delta)) / 2n).
KL(Q||P) is the complexity cost of posterior relative to prior.
For Gaussian weights KL is analytically computable.
Direct PAC-Bayes bound optimization = simultaneous training and certification.