Information Geometry
Geodesic Flows on Statistical Manifolds
The EM algorithm converges monotonically - known since 1977. The geometric reason was provided by Amari in 1985: each EM step is a geodesic projection, and monotone convergence follows from the KL Pythagorean theorem.
- EM algorithm: E-step is m-projection, M-step is e-projection; KL Pythagorean theorem gives the geometric proof of monotone likelihood increase at every iteration
- Mean-field variational inference: the factored approximation q(z) = prod q_i(z_i) is an m-projection of the true posterior onto the manifold of factored distributions
- LDA topic models: variational EM uses the e/m projection structure to guarantee convergence of the algorithm
- Online learning: natural gradient algorithms are discretizations of geodesic flow on the statistical manifold
Предварительные знания
- Exponential families and natural parameters
- KL-divergence properties
- Affine connections on manifolds
Alpha-Geodesics and the KL Pythagorean Theorem
A statistical manifold carries a one-parameter family of affine connections indexed by alpha in R. At alpha = +1 (exponential connection) geodesics are straight lines in natural-parameter coordinates eta. At alpha = -1 (mixture connection) geodesics are straight lines in mean-parameter coordinates mu. At alpha = 0 one recovers the Levi-Civita connection. The dual pair alpha = +/-1 underlies the KL Pythagorean theorem.
Amari (1985) proved a duality theorem: for any pair of dual connections (nabla, nabla*) with respect to the Fisher metric g, the curvature of nabla vanishes iff the curvature of nabla* vanishes. For exponential families both the +1 and -1 connections are flat. This flatness is what makes the Pythagorean theorem exact, not approximate.
What does the KL Pythagorean theorem state in information geometry?
KL Pythagorean theorem: D_KL(p||r) = D_KL(p||q) + D_KL(q||r) when q is the m-projection of p onto an e-flat submanifold containing r. This is the geometric foundation of EM convergence. KL does not satisfy the triangle inequality in general.
Variational Inference Geometry: Projections and ELBO
Variational inference finds the closest approximation q(z) approx p(z|x) within a tractable family Q. Geometrically this is finding the nearest point on the submanifold Q inside the space of all distributions. The choice of divergence determines which projection is computed: KL(q||p) gives an e-projection, KL(p||q) gives an m-projection. They produce qualitatively different approximations.
Expectation Propagation (EP) minimizes KL(p||q) - the mass-covering direction - by iterative factor updates. EP tends to produce better-calibrated uncertainty estimates than ELBO-based VI for multimodal posteriors, at higher computational cost. The geometric difference is that EP solves for an m-projection while ELBO-VI solves for an e-projection.
Why does standard variational inference (ELBO maximization) tend to underestimate posterior variance?
KL(q||p) penalizes q placing mass where p is small (zero-forcing), not q missing mass where p is large. So the optimizer finds q concentrated on one mode of p, ignoring others. This is an e-projection. The KL(p||q) direction (m-projection, used by EP) forces q to cover all regions where p > 0, giving mass-covering behavior.
Bregman Divergences and the Flat Geometry of Exponential Families
A statistical manifold is e-flat if the curvature of the exponential connection vanishes, and m-flat if the mixture curvature vanishes. Exponential families are simultaneously e-flat and m-flat - dually flat in both connections. This double flatness is what makes the KL Pythagorean theorem exact and enables closed-form MLE, projections, and EM steps.
Bregman k-means generalizes k-means by replacing squared Euclidean distance with any Bregman divergence. For KL as the Bregman divergence, the centroid update becomes the mean of sufficient statistics - matching the M-step of EM for mixture models. The convergence proof uses the Bregman Pythagorean theorem.
What property of exponential families guarantees a unique MLE and makes EM converge monotonically?
Dual flatness means the curvature tensors of both the (+1) and (-1) connections vanish. This implies: (1) e-geodesics are straight in eta, (2) m-geodesics are straight in mu, (3) KL Pythagorean theorem holds exactly, (4) MLE reduces to moment matching, (5) EM has monotone convergence. All closed-form statistical results in exponential families follow from this single geometric property.
Connections to other topics
The dual alpha-connection structure explains fundamental machine learning algorithms through geometry.
- EM algorithm — Related topic
- Mean-field variational inference — Related topic
- Bregman divergences — Related topic
Итоги
- Alpha-connections: one-parameter family of affine structures; alpha = +/-1 is a dual pair, alpha = 0 is Levi-Civita
- e-geodesics (alpha = +1): normalized geometric mixtures, straight lines in natural-parameter space eta
- m-geodesics (alpha = -1): ordinary convex mixtures, straight lines in mean-parameter space mu
- KL Pythagorean theorem: KL(p||r) = KL(p||q) + KL(q||r) at orthogonal intersection of e- and m-geodesics
- EM: E-step = m-projection, M-step = e-projection; monotone convergence follows from the Pythagorean theorem