Statistical Learning Theory
PAC-Bayes generalization bounds
In 2022 a Google Brain team applied PAC-Bayes to transformers with 110M parameters and obtained the first non-trivial theoretical generalization guarantees for large neural networks. McAllester (1999) created a tool that works where classical VC theory gives useless bounds.
- Dziugaite et al. (NetSafe 2021) used PAC-Bayes certification for safety guarantees in autonomous systems.
- Norm-based PAC-Bayes bounds from Microsoft Research explained why overparameterized networks don't overfit.
McAllester bound and improvements
In 1999 David McAllester proved that for any prior P and posterior Q over hypotheses, with probability >= 1-delta over the training sample of size n: expected risk <= emp. risk + sqrt((KL(Q||P) + log(n/delta)) / (2n)). This is the first bound that can improve with more parameters given the right prior. In 2022 Dziugaite et al. applied PAC-Bayes to ResNet-50 and obtained non-trivial bounds of 0.61 at actual error 0.24.
In the PAC-Bayes bound: if prior P is very wide (large variance), then KL(Q||P)...
PAC-Bayes for neural networks
Dziugaite and Roy (2017) applied PAC-Bayes to neural networks with Gaussian weights. They directly minimized the PAC-Bayes bound via SGD, simultaneously training and certifying generalization. On MNIST they obtained a guaranteed error bound of 16% at actual 2%.
The PAC-Bayes bound is non-trivial (< 1) only when...
Key results
- PAC-Bayes: true risk <= emp. risk + sqrt((KL(Q||P) + log(n/delta)) / 2n).
- KL(Q||P) is the complexity cost of posterior relative to prior.
- For Gaussian weights KL is analytically computable.
- Direct PAC-Bayes bound optimization = simultaneous training and certification.