Statistical Learning Theory

PAC-Bayes: the first non-vacuous bound for deep networks

ResNet-50 has 25 million parameters. The ImageNet training set has 1.2 million samples. The classical VC generalization bound for a neural network with W parameters is roughly: $$R(h) \leq \hat{R}(h) + O\!\left(\sqrt{\frac{W \ln W}{m}}\right)$$ Plug in the numbers: $W = 25 \times 10^6$, $m = 1.2 \times 10^6$. The bound exceeds 1. A probability cannot exceed 1. The theorem is mathematically airtight and says absolutely nothing. ResNet-50 achieves 4.9% top-1 error. The theory that justifies SVM generalization and PAC-learnability cannot explain why that number is not 95%. In 2017, Gintare Dziugaite and Daniel Roy submitted a paper to NeurIPS. They proved - rigorously, with no hand-waving - that a specific neural network achieving 2% test error on MNIST generalizes, and the true error is at most 16%. First non-vacuous bound for a deep network in history. The key was not a better bound for a single hypothesis. It was changing the question entirely: instead of bounding the error of one network, prove a bound for a distribution over networks.

**SAM optimizer (Foret et al. 2021, used in ViT-G and PaLM fine-tuning):** Sharpness-Aware Minimization explicitly seeks flat minima - the exact regions where the PAC-Bayes bound is tightest. Google uses it in production training runs
**Bayesian deep learning (weight uncertainty in NNs):** treating weights as distributions rather than point estimates is not just a probabilistic convenience - it is the PAC-Bayes posterior Q. Blundell et al. (2015) Bayes by Backprop is PAC-Bayes optimization in disguise
**VAE training objective:** the ELBO = reconstruction - KL(q||p) is structurally identical to the PAC-Bayes bound: empirical loss minus KL complexity penalty. Posterior collapse in VAEs is a PAC-Bayes instability: Q collapses to P, KL = 0, no information passes through
**Model compression and pruning:** a pruned network with fewer effective parameters has a lower-KL posterior when P is the Gaussian prior. PAC-Bayes explains why pruning improves generalization beyond just reducing overfitting

Предварительные знания

VC dimension as a measure of hypothesis class complexity
Generalization bound structure: $R(h) \leq \hat{R}(h) + \text{complexity}(\mathcal{H}, m, \delta)$
KL divergence: $\mathrm{KL}(Q \| P) = \mathbb{E}_{h \sim Q}[\log(Q(h)/P(h))] \geq 0$
Inductive bias as the practical answer to No-Free-Lunch (lt-08)

Concept 1: The vacuous VC bound problem

The standard PAC generalization bound for a finite hypothesis class is $R(h) \leq \hat{R}(h) + \sqrt{\ln(|\mathcal{H}|/\delta)/(2m)}$. For infinite classes - neural networks - VC dimension replaces $\ln|\mathcal{H}|$. Bartlett et al. showed that the VC dimension of a network with $W$ real-valued parameters is at least $W$. The bound becomes:

For ResNet-50: $W = 25 \times 10^6$, $m = 1.2 \times 10^6$, so $W/m \approx 20.8$. The bound exceeds 1. For MNIST with a modest network - say $W = 10^6$, $m = 60{,}000$ - $W/m \approx 16.7$, still vacuous. A bound $> 1$ on a probability is not informative: it says only that the error is at most 100%, which is true without any theory at all.

**Vacuous vs. meaningful bounds.** A bound is vacuous if it cannot exclude the worst case (error = 1). VC theory is tight for worst-case tasks, but real neural networks do not operate at worst-case: the task has structure, SGD has implicit bias, architectures encode priors. VC counts parameters - not their structure. A network with 25M parameters that SGD pushed to a flat minimum is not the same as a network that could implement any of $2^{25M}$ possible functions. PAC-Bayes captures this distinction; VC theory does not.

ResNet-50 achieves 4.9% test error on ImageNet. The VC bound gives a value > 1. What does this mean in practice?

Concept 2: The PAC-Bayes theorem (McAllester 1998)

The PAC-Bayes shift is conceptual: instead of bounding the error of a single hypothesis $h$, bound the error of a distribution over hypotheses. At test time, the classifier draws $h \sim Q$ and predicts $h(x)$. This stochastic classifier has expected error $R(Q) = \mathbb{E}_{h \sim Q}[R(h)]$ and empirical error $\hat{R}(Q) = \mathbb{E}_{h \sim Q}[\hat{R}(h)]$.

Two distributions appear. $P$ is the prior - fixed before seeing any data. $Q$ is the posterior - learned from training. The complexity of $Q$ is measured relative to $P$ via KL divergence. If $Q \approx P$, the data taught the learner nothing new and complexity is low. If $Q$ is far from $P$, the learner extracted a lot from the data and must pay a complexity cost.

**PAC-Bayes theorem (McAllester 1998).** Let $P$ be any distribution over hypotheses fixed before observing data. For any $\delta > 0$, with probability $\geq 1 - \delta$ over the draw of $m$ training samples, simultaneously for all posterior distributions $Q$: $$R(Q) \leq \hat{R}(Q) + \sqrt{\frac{\mathrm{KL}(Q \| P) + \ln(m/\delta)}{2m}}$$ where $R(Q) = \mathbb{E}_{h \sim Q}[R(h)]$ is the true expected error under $Q$, and $\hat{R}(Q) = \mathbb{E}_{h \sim Q}[\hat{R}(h)]$ is its empirical counterpart.

Three features distinguish this from VC bounds. First, $Q$ can be anything - it is data-dependent. The bound holds simultaneously for all $Q$, so choosing $Q$ after seeing data is legitimate. Second, $P$ is chosen before training and encodes the prior over hypothesis space - this is where domain knowledge enters. Third, the complexity term is $\mathrm{KL}(Q \| P)$, not parameter count: two networks with identical architecture but different optimization trajectories can have vastly different bounds.

Both networks have the same architecture and the same 2% training error. The PAC-Bayes bound differs by orders of magnitude. The sharp minimum has KL in the hundreds of thousands - vacuous. The flat minimum has KL around 100 - a bound near 0.05. Same model family, same task, different region of weight space: completely different generalization certificates.

In the PAC-Bayes theorem, KL(Q || P) measures complexity. Why is KL a better complexity measure than parameter count for neural networks?

Concept 3: Non-vacuous bounds for neural nets (Dziugaite-Roy 2017)

Dziugaite and Roy's contribution was not a new theorem - McAllester's result was 19 years old. The contribution was showing the PAC-Bayes bound can be directly optimized. Instead of analyzing a fixed trained network post-hoc, choose $Q$ to minimize the bound itself.

The setup: $P = \mathcal{N}(0, \lambda I)$ - an isotropic Gaussian prior on weight vectors. $Q = \mathcal{N}(w, \mathrm{diag}(\sigma^2))$ - a diagonal Gaussian posterior centered at trained weights $w$ with per-parameter variance $\sigma^2$. Both $w$ and $\sigma$ are optimized jointly to minimize:

For a network with $W$ parameters, the KL between two diagonal Gaussians has a closed form:

Dziugaite-Roy result for MNIST (fully connected network, $m = 60{,}000$): KL $\approx 10^6$ after optimization, and the bound evaluates to approximately $0.16$ (16%). The network achieves 2% test error. The bound is not tight - but it is the first number below 1 ever proven for a deep network. It rules out catastrophic generalization failure.

The stochastic classifier is not just a mathematical device. At test time, drawing $h \sim Q = \mathcal{N}(w, \sigma^2 I)$ means perturbing weights with Gaussian noise and running a forward pass. At a flat minimum, this perturbation barely changes predictions - the classifier is robust, empirical stochastic error is low. At a sharp minimum, the same perturbation flips many predictions - high $\hat{R}(Q)$ and thus a looser bound.

Dziugaite and Roy minimize the PAC-Bayes objective directly rather than first training normally and then computing the bound. Why does this matter?

Concept 4: Flat minima, sharpness, and the geometry of generalization

The KL between the Dziugaite-Roy posterior and prior is:

Small KL requires two things simultaneously: (1) $\|w\|^2$ small - weights close to the origin, and (2) $\sigma^2$ large - the posterior is spread, meaning the loss landscape is flat around $w$. Large $\sigma^2$ means the network tolerates weight perturbations of size $\sigma$ without changing its predictions. This is the definition of a flat minimum: a region where the Hessian trace is small.

**How standard optimizers connect to PAC-Bayes:** - **SGD with large learning rate:** implicit flat-minimum bias (Keskar 2017). Large-batch SGD finds sharper minima, generalizes worse - higher KL in the bound - **SAM (Sharpness-Aware Minimization, Foret et al. 2021):** explicitly perturbs weights and minimizes worst-case loss in a $\rho$-neighborhood. This is PAC-Bayes with a uniform perturbation posterior. Used in ViT-G and PaLM - **Weight decay (L2 regularization):** penalizes $\|w\|^2$ directly - the first KL term. L2 regularization is MAP estimation under a Gaussian prior, which is exactly the PAC-Bayes setup with $P = \mathcal{N}(0, \lambda I)$ - **Dropout:** stochastic masking at train time creates an implicit posterior over sparse networks, reducing effective KL

The connection to VAEs is structural. The VAE ELBO is $\mathbb{E}_{z \sim q_\phi}[\log p_\theta(x|z)] - \mathrm{KL}(q_\phi(z|x) \| p(z))$. Maximizing ELBO is minimizing reconstruction loss plus KL complexity - the exact PAC-Bayes structure: $\hat{R}(Q) + \mathrm{KL}(Q \| P)$. Posterior collapse (where $q_\phi \to p$ and KL $\to 0$) is the degenerate PAC-Bayes solution: zero complexity but zero information - the reconstruction term wins by producing blurry averages.

The optimal $\sigma_q$ is neither zero (infinitely sharp, KL diverges) nor maximal (infinitely flat, stochastic predictions become random). It is the saddle point of the KL-vs-stochastic-risk trade-off. SAM, by seeking the flattest minimum within a $\rho$-ball, targets this region without ever computing the PAC-Bayes bound explicitly. The bound explains why SAM works; SAM makes the bound tractable to optimize.

L2 regularization (weight decay) penalizes $\|w\|^2$. In PAC-Bayes terms, what does this correspond to?

Key takeaways

**Vacuous VC bounds:** for $W > m$, the VC bound exceeds 1 - mathematically correct, practically useless. VC counts parameters; PAC-Bayes measures how far the trained hypothesis is from the prior
**PAC-Bayes theorem (McAllester 1998):** $R(Q) \leq \hat{R}(Q) + \sqrt{(\mathrm{KL}(Q \| P) + \ln(m/\delta)) / (2m)}$ - holds simultaneously for all posteriors $Q$, with $P$ fixed before training
**Dziugaite-Roy 2017:** directly minimize the PAC-Bayes objective with $P = \mathcal{N}(0, \lambda I)$, $Q = \mathcal{N}(w, \sigma^2 I)$. First non-vacuous deep learning bound: 16% for a network with 2% test error
**Flat minima = low KL:** $\mathrm{KL}(\mathcal{N}(w, \sigma^2 I) \| \mathcal{N}(0, \lambda I))$ is small when $\|w\|^2$ is small and $\sigma^2$ is large - the flat minimum condition. This is why flat minima generalize
**SAM = PAC-Bayes optimization:** Sharpness-Aware Minimization seeks the flattest minimum in a $\rho$-ball, directly minimizing the KL term without computing the bound explicitly
**Unified framework:** L2 regularization = Gaussian prior PAC-Bayes. VAE ELBO = PAC-Bayes structure. Dropout = implicit sparse posterior. Posterior collapse = degenerate zero-KL solution

What comes next

PAC-Bayes connects backward to complexity theory and forward to the practical tools that make deep learning work:

Margin bounds — Margin is a concrete inductive bias formalized as a PAC-Bayes posterior concentrated on large-margin classifiers
Deep generalization — Implicit SGD bias toward flat minima is PAC-Bayes complexity control in disguise; double descent revisited through the KL lens
Rademacher complexity — Rademacher bounds are data-dependent bounds for a class; PAC-Bayes replaces the sup over the class with a KL-weighted average - the two are dual perspectives
No-Free-Lunch — NFL demanded a prior over hypotheses; PAC-Bayes provides the formal machinery - P is the prior, KL(Q||P) is the price of deviating from it

Вопросы для размышления

Dziugaite-Roy proved a 16% bound for a network with 2% test error - a gap of 14%. Is this gap a problem with PAC-Bayes theory, or a problem with the choice of prior P = N(0, lambda I)? What prior would tighten the bound, and how would you choose it without touching test data?
SAM minimizes sharpness to tighten the PAC-Bayes bound and consistently improves test accuracy on benchmarks. Does this mean sharpness and generalization are causally linked, or could both be correlated with a third variable - such as the geometry of the data manifold or the signal-to-noise ratio of the task?
The VAE ELBO and the PAC-Bayes bound have identical structure. Does a VAE with a well-trained posterior Q give a generalization certificate for the decoder? What would it mean for a generative model to generalize in the PAC-Bayes sense - and is sample quality the right analog of test error?

Связанные уроки

lt-08-no-free-lunch — NFL showed that inductive bias is mandatory; PAC-Bayes makes that bias explicit as a prior distribution
lt-06-rademacher — Rademacher gives data-dependent bounds for a class; PAC-Bayes is the Bayesian version - data-dependent complexity via KL divergence
lt-11-margin-bounds — Margin bounds are a special case of PAC-Bayes when Q concentrates on a single large-margin hypothesis
lt-13-deep-generalization — Flat minima, SAM, and implicit SGD bias all minimize the PAC-Bayes objective implicitly
lt-07-uniform-convergence
prob-04-bayes