Information Geometry

Applications: Neural Networks and Optimal Experiment Design

Laplace approximation in Bayesian Neural Networks (2021, 60M-parameter network) uses Fisher geometry for uncertainty quantification. The Fisher matrix is simultaneously the Hessian of the log-likelihood and the metric of the parameter space: two roles, one matrix.

Laplace Redux (Daxberger et al., NeurIPS 2021): scalable Bayesian DL via K-FAC Fisher
D-optimal design in Phase III clinical trials: reduces sample size by 20-40% at the same accuracy
KFAC optimizer in JAX (Google, 2022): natural gradient for transformers up to 1B parameters

Laplace Approximation in Bayesian Neural Networks

Bayesian inference for a 60M-parameter neural network requires integrating over a 60M-dimensional space, which is analytically intractable. Laplace approximation: find the MAP point, then approximate the posterior as a Gaussian with covariance equal to the inverse Hessian of the log posterior. The Hessian of the log-likelihood is the Fisher matrix. One matrix, two roles.

Laplace Redux (NeurIPS 2021) applies this to ResNet-50: K-FAC approximates the 25M x 25M Hessian via 512 x 512 block products. Result: calibrated uncertainty without Bayesian training from scratch.

Why does the Laplace approximation use the Fisher matrix instead of the full Hessian of the log posterior?

Optimal Experiment Design via Fisher Information

In Phase III clinical trials the choice of dose levels (experiment design xi) determines estimation accuracy. D-optimal design maximizes det(I(theta,xi)) , the volume of the information ellipsoid. In practice this reduces sample size by 20-40% at the same accuracy.

D-optimal design is equivalent to maximizing the differential entropy of the Gaussian estimator: h(theta-hat) = (d/2)(1+log 2pi) + (1/2) log det(I^{-1}). This connects optimal design to information theory.

D-optimal design maximizes det(I(theta, xi)). What is the geometric meaning?

K-FAC: Practical Kronecker-Factored Fisher Approximation

For a network with one million parameters the full Fisher matrix occupies 10^12 bytes. K-FAC (Kronecker-Factored Approximate Curvature) approximates it as a block-diagonal matrix with Kronecker structure in each block: F approx direct-sum_l (G_l x A_l). This reduces storage from O(nm) to O(n+m) per layer.

Laplace Redux (NeurIPS 2021) uses K-FAC Fisher as the Hessian approximation in Laplace approximation. This closes the loop: statistical manifolds -> natural gradient -> K-FAC -> Bayesian uncertainty in neural networks.

K-FAC approximates the Fisher block for layer l as G_l ⊗ A_l. Under what condition is this approximation accurate?

Итоги

Laplace approx: p(theta|D) approx N(theta_MAP, I(theta_MAP)^{-1}) , Fisher as Hessian of log posterior
D-optimality: max_xi log det I(theta,xi) , maximizes experimental information
A-optimality: min_xi tr(I(theta,xi)^{-1}) , minimizes average parameter variance
K-FAC: F approx direct-sum_l (G_l ⊗ A_l) , block-Kronecker approximation of Fisher

Вопросы для размышления

Why is the Fisher matrix simultaneously the Fisher-Rao metric and the Hessian of the log-likelihood?
How does D-optimal experiment design relate to geodesics on the statistical manifold?
Under what conditions is the K-FAC approximation of the Fisher matrix accurate enough for Laplace approximation?

Связанные уроки

ig-17-natural-gradient-deep — ig-20 covers production use of K-FAC and Fisher geometry
ig-18 — Laplace approximation uses the Fisher matrix as the Hessian

Laplace Approximation in Bayesian Neural Networks

Laplace Redux (NeurIPS 2021) applies this to ResNet-50: K-FAC approximates the 25M x 25M Hessian via 512 x 512 block products. Result: calibrated uncertainty without Bayesian training from scratch.

Why does the Laplace approximation use the Fisher matrix instead of the full Hessian of the log posterior?

Optimal Experiment Design via Fisher Information

D-optimal design maximizes det(I(theta, xi)). What is the geometric meaning?

K-FAC: Practical Kronecker-Factored Fisher Approximation

K-FAC approximates the Fisher block for layer l as G_l ⊗ A_l. Under what condition is this approximation accurate?

Итоги

Laplace approx: p(theta|D) approx N(theta_MAP, I(theta_MAP)^{-1}) , Fisher as Hessian of log posterior

D-optimality: max_xi log det I(theta,xi) , maximizes experimental information

A-optimality: min_xi tr(I(theta,xi)^{-1}) , minimizes average parameter variance

K-FAC: F approx direct-sum_l (G_l ⊗ A_l) , block-Kronecker approximation of Fisher