Information Geometry
Information projections: e and m
MLE and variational inference - different formulas, different communities, different intuitions. Frequentists use log-likelihood. Bayesians use ELBO. But both are information projections. The only difference: the direction of KL divergence. MLE projects the model onto the data via $D_{KL}(q \| p_{data})$ - mode-seeking. Standard VI projects the posterior approximation via $D_{KL}(q \| p_{true})$ - also mode-seeking. Flip the direction and the result is forward KL, mode-covering. Two buttons, two behaviours. Pythagoras unifies them inside one geometry - and from that single theorem, EM, belief propagation and Sinkhorn all follow.
- **MLE and moment matching**: training GPT, BERT, any LLM is an e-projection of the empirical distribution onto the parametric family. Cross-entropy loss = $D_{KL}(p_{data} \| p_\theta)$ up to a constant. 50257 tokens, trillions of pairs - one projection formula
- **VAE and VQ-VAE**: ELBO optimisation is an e-projection of $q_\phi(z|x)$ onto the posterior $p(z|x)$. Stable Diffusion uses a VAE to compress into latent space. Each training step - millions of e-projections of the posterior onto a Gaussian
- **EM in production**: GMM, HMM, LDA (topic modelling), EM for aligned sequences in ASR (inside Whisper), K-Means as degenerate EM - all alternate e/m projections following Csiszar-Tusnady (1984)
- **Belief propagation and factor graphs**: Loopy BP in the Ising model, message passing in LDA, turbo codes (iterative decoding via BP) - sequential m-projections onto factor constraints. Expectation Propagation (EP) replaces them with e-projections for better convergence in cyclic graphs
Предварительные знания
- KL divergence as a Bregman divergence
- Exponential families and natural parameters
- α-connections: e- and m-connections as endpoints of the α-family
e-projection: moment matching and MLE
e-projection: the nearest point in an exponential family
Given an arbitrary distribution $p$ and an exponential submanifold $\mathcal{E}$ - a family of the form $q_\theta \propto \exp(\theta \cdot T(x))$ - the task is to find the $q^* \in \mathcal{E}$ closest to $p$ in KL divergence. The e-projection is:
Minimising $D_{KL}(q \| p) = \mathbb{E}_q[\log q - \log p]$ over $\theta$ is equivalent to maximising $\mathbb{E}_q[\log p]$, which is maximum likelihood. There is a twist: inside an exponential family, $D_{KL}(q_\theta \| p)$ hits its minimum exactly when the **sufficient statistics** match - $\mathbb{E}_{q^*}[T(x)] = \mathbb{E}_p[T(x)]$. That is moment matching.
**Moment matching = e-projection.** If $\mathcal{E}$ is the Gaussian family, the e-projection of $p$ onto $\mathcal{E}$ is the Gaussian with the same mean and variance as $p$. If $\mathcal{E}$ is multinomial, the projection is the empirical distribution. No iterative optimisation: match the moments and the projection is done.
**The M-step of EM is an e-projection.** In EM for GMM or HMM, parameters are updated to maximise the expected log-likelihood - that is, the current posterior is e-projected onto the parametric submanifold $\mathcal{E}$. Csiszar-Tusnady (1984) proved this is precisely what the M-step does, and that it guarantees monotone likelihood increase at every iteration.
The e-projection $q^* = \arg\min_{q \in \mathcal{E}} D_{KL}(q \| p)$ is equivalent to...
m-projection: mean-field and variational inference
m-projection: the nearest point in a mixture submanifold
Now reverse the direction of KL. The submanifold $\mathcal{M}$ is m-flat - a mixture family, for example the set of fully factorised distributions. The m-projection of $p$ onto $\mathcal{M}$:
Minimising $D_{KL}(p \| q) = \mathbb{E}_p[\log p - \log q]$ means maximising $\mathbb{E}_p[\log q]$. This is **mass-covering**: $q$ must cover every region where $p > 0$, otherwise $D_{KL}(p \| q) = \infty$. A single missed mode of $p$ costs an infinite penalty.
**Mean-field variational inference is an m-projection** onto the submanifold of factorised distributions. The VAE minimises $D_{KL}(q_\phi(z|x) \| p(z|x))$ - an e-projection of the posterior onto the parametric family $q_\phi$. The ELBO is a lower bound on the log-evidence, equal to the negative KL up to a constant. VQ-VAE, hierarchical VAEs, diffusion models trained via ELBO - all e-project the posterior onto a tractable family.
**The key difference: mode-seeking vs mode-covering.** e-projection ($D_{KL}(q \| p)$): if $q(x) > 0$ but $p(x) = 0$ - infinite penalty. So $q$ cannot leave the support of $p$, but it can ignore modes. Result: mode-seeking - $q$ picks one mode and concentrates there. m-projection ($D_{KL}(p \| q)$): if $p(x) > 0$ but $q(x) = 0$ - infinite penalty. So $q$ is forced to cover everything where $p > 0$. Result: mode-covering - $q$ spreads across all modes of $p$.
Why does the standard VAE (ELBO = $\mathbb{E}_q[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$) yield a mode-seeking approximation of the posterior?
Pythagorean theorem in information geometry
Pythagoras without Euclidean space
In Euclidean geometry the Pythagorean theorem is $|AC|^2 = |AB|^2 + |BC|^2$ for a right triangle. In information geometry, Amari (1985) derived an exact analogue for KL divergence - without squares, without Euclidean distances, but valid for all distributions in an exponential family:
Here $q^*$ is the e-projection of $r$ onto the e-flat submanifold passing through $p$ - equivalently, the m-projection of $p$ onto the m-flat submanifold through $r$. The condition: the e-geodesic from $q^*$ to $p$ and the m-geodesic from $q^*$ to $r$ must be **orthogonal** in the Fisher metric. That orthogonality is the statistical right angle.
**What the Pythagorean theorem delivers.** If $q^*$ is the projection of $p$ onto a submanifold $\mathcal{M}$, then for any other $r \in \mathcal{M}$: $D_{KL}(p \| r) \geq D_{KL}(p \| q^*)$. The projection is minimal. This is not a heuristic - it follows from geodesic orthogonality. That is how one proves EM does not diverge: each step decreases KL, and no other point in $\mathcal{M}$ gives a smaller KL than the projection.
Amari's Pythagorean theorem $D_{KL}(p \| r) = D_{KL}(p \| q^*) + D_{KL}(q^* \| r)$ holds when...
EM and belief propagation as projections
EM: two steps - two kinds of projection
Csiszar and Tusnady (1984) - one year before Amari's book - gave EM an exact geometric reading. The algorithm operates on two submanifolds: $\mathcal{E}$ is e-flat (the parametric model $p_\theta$) and $\mathcal{M}$ is m-flat (completed-data distributions, the joint over observations and latent variables).
E-step: m-projection of $p_\theta$ onto the m-flat submanifold $\mathcal{M}$ - computing the posterior $q(z|x, \theta)$. M-step: e-projection of $q$ onto the e-flat $\mathcal{E}$ - moment matching, maximising expected log-likelihood. Two steps, two projections, alternating. The Pythagorean theorem guarantees the log-likelihood never decreases: $D_{KL}(p_{data} \| p_{\theta^{(t+1)}}) \leq D_{KL}(p_{data} \| p_{\theta^{(t)}})$.
**Belief propagation as sequential projections.** In factor graphs (LDA, Ising model, Bayesian networks) message passing is sequential m-projection onto factor constraints. Each message $\mu_{f \to x}$ is an m-projection of the local joint onto the marginal submanifold. Loopy BP in the Ising model consists of iterated m-projections, which can fail to converge in cyclic graphs. Expectation Propagation (Minka, 2001) replaces them with e-projections onto a factorised family - more stable in the presence of cycles.
**Sinkhorn = EM for OT.** The Sinkhorn algorithm for optimal transport is EM on the transport manifold. Row normalisation is an m-projection onto the marginal constraint on $x$. Column normalisation is an m-projection onto the marginal constraint on $y$. The Pythagorean theorem again guarantees monotone convergence. Three different problems (mixture models, graphical models, optimal transport) - one geometric mechanism.
EM converges because each step is globally optimal - the M-step finds the global maximum
EM's monotonicity is a geometric fact (Pythagorean theorem for e/m projections), not a consequence of global optimality of individual steps
EM can get stuck in a local maximum. The theorem guarantees only monotonicity: each step is no worse than the previous one. Where exactly the algorithm converges depends on the starting point, not on the geometry. Geometry accounts for step-wise monotonicity, not for global optimality.
In EM for GMM the E-step is an m-projection and the M-step is an e-projection. What guarantees monotone increase of log-likelihood?
Key takeaways
- **e-projection** $q^* = \arg\min_{q \in \mathcal{E}} D_{KL}(q \| p)$: mode-seeking, moment matching, equivalent to MLE. The EM M-step is an e-projection onto the parametric submanifold
- **m-projection** $q^* = \arg\min_{q \in \mathcal{M}} D_{KL}(p \| q)$: mode-covering, mean-field VI, ELBO. The EM E-step is an m-projection onto the completed-data manifold
- **Pythagorean theorem**: $D_{KL}(p \| r) = D_{KL}(p \| q^*) + D_{KL}(q^* \| r)$ when e/m geodesics are orthogonal. Guarantees minimality of projection and monotone convergence of EM
- **EM = alternating e/m**: E-step - m-projection, M-step - e-projection. Csiszar-Tusnady (1984): monotonicity proven via the Pythagorean theorem, not a heuristic
- **Belief propagation and Sinkhorn** are sequential m-projections onto factor constraints. One mechanism, three problems: mixture models, graphical models, optimal transport
Where this leads
Information projections are the tool. Applications are built on top of them:
- Mirror descent ($\alpha=-1$) — m-projection onto a convex set with a Bregman divergence
- Information geometry in deep learning — VAE, diffusion, normalising flows through e/m projections
Вопросы для размышления
- The VAE minimises $D_{KL}(q_\phi(z|x) \| p(z|x))$ - an e-projection, mode-seeking. If one replaces it with the forward KL $D_{KL}(p(z|x) \| q_\phi(z|x))$, what happens to generation quality, and why is that direction much harder to train?
- Belief propagation in cyclic graphs fails to converge. Expectation Propagation replaces m-projections with e-projections. Why does the direction of KL improve convergence in the presence of cycles - what exactly changes?
- EM for GMM monotonically increases likelihood yet gets stuck in local maxima. K-Means is degenerate EM with hard assignments. What is lost moving from soft (EM) to hard (K-Means) posteriors, and what role does the projection geometry play there?
Связанные уроки
- ig-06-alpha-connections — e- and m-projections are the α=+1 and α=-1 instances of the α-connection family
- ig-07-natural-gradient — Natural gradient is the α=0 flow; projections live at the α=±1 endpoints
- ig-09-mirror-descent — Mirror descent with KL is m-projection onto a convex set - a direct consequence
- ig-10-deep-learning — VAE ELBO = m-projection; EM in GMM/LDA = alternating e/m projections
- ig-04-kl-bregman — KL divergence is the core instrument for both kinds of projection
- stat-01-sampling