Information Geometry
Amari α-connections: a whole family of geodesics
Chernoff, 1952: introduces α-divergence for asymptotic statistics - hypothesis tests where 'distance' between distributions must work in the large-sample limit. Amari, 1985: realises this is two different ways of moving across a multinomial simplex. One is additive (mixture, $p_t = (1-t)p_0 + tp_1$). The other is multiplicative (exponential, $p_t \propto p_0^{1-t} p_1^t$). Between them sits an entire family of geodesics $\nabla^{(\alpha)}$, and every ML training run picks one implicitly. Adam follows a Euclidean line. K-FAC follows $\alpha=0$. Mirror descent with KL follows $\alpha=-1$. EM alternates $\alpha=\pm 1$. A single parameter $\alpha$ structures half of applied optimisation.
- **Natural gradient / K-FAC** ($\alpha=0$): motion along Levi-Civita geodesics of the Fisher metric. Used in production training of large models at Google Brain - 2-5x speedup over Adam on tasks with clear exp-family structure
- **KL-mirror descent / Hedge / Exponentiated Gradient** ($\alpha=-1$): m-flat flow on the simplex. AdaBoost, portfolio optimisation, online learning - the same multiplicative step in three different fields
- **EM (GMM, HMM, LDA, VAE)**: alternation of $\alpha=-1$ (m-projection, E-step) and $\alpha=+1$ (e-projection, M-step). Csiszar-Tusnady (1984) proved convergence via Amari's Pythagorean theorem - geometry, not a heuristic
- **RLHF / PPO**: $D_{KL}(\pi_\theta \| \pi_{ref})$ - reverse KL ($\alpha=+1$), mode-seeking. This is exactly why a fine-tuned policy concentrates around good behaviour rather than spreading thin
Предварительные знания
- KL and Bregman divergence
- Fisher metric and Cramér-Rao
- Dually flat structure: e- and m-connections
Two connections on the simplex and the interpolation between them
Two natural connections on the simplex
On the space of distributions there is no single canonical geodesic. Two coordinate systems - $\theta$ (natural parameters of an exponential family) and $\eta$ (expectation parameters) - live on the same manifold and each gives rise to its own notion of flatness.
The **e-connection** $\nabla^{(e)}$ declares straight those curves that are linear in $\theta$. Between two distributions $p_0$ and $p_1$ the e-geodesic is $\theta_t = (1-t)\theta_0 + t\theta_1$. In density space this becomes log-affine interpolation: $\log p_t = (1-t)\log p_0 + t\log p_1 - \log Z_t$. Multiplicative blending, normalised by $Z_t$.
The **m-connection** $\nabla^{(m)}$ declares straight those curves linear in $\eta$: $\eta_t = (1-t)\eta_0 + t\eta_1$. In density space this is an affine mixture: $p_t = (1-t)p_0 + t p_1$. Additive blending, no normalisation needed - the simplex is already convex.
**Physical intuition**. The e-geodesic is a smooth flow of noise: blending two distributions through the log domain shifts probability mass the way softmax does when its logits change. The m-geodesic is a smooth mixture: with probability $1-t$ a sample is drawn from $p_0$, otherwise from $p_1$. Same pair of endpoints, two different parametrisations of the path between them.
α-family: interpolating between e and m
Amari, 1985: between $\nabla^{(e)}$ and $\nabla^{(m)}$ lives a one-parameter family. The parameter $\alpha \in [-1, 1]$ sets the mix:
The endpoints. $\alpha=1$: only $\nabla^{(e)}$ survives - e-connection, e-flat manifold, geodesics linear in $\theta$. $\alpha=-1$: only $\nabla^{(m)}$ survives - m-connection, m-flat manifold, geodesics linear in $\eta$. $\alpha=0$: exactly halfway - the result is the Levi-Civita connection of the Fisher metric. Amari's natural gradient, 1998, follows precisely this connection.
**The catch**. e- and m-connections are both flat (zero curvature), yet they are not the same connection. Intermediate values $\alpha \in (-1, 1)$ are no longer flat - the curvature of $\nabla^{(\alpha)}$ is non-zero whenever $\alpha \neq \pm 1$. The geometry meanders between two flat extremes, and Fisher-Levi-Civita sits at the centre of mass.
Which $\alpha$-connection coincides with the Levi-Civita connection induced by the Fisher metric?
α-divergences and the link to Renyi
α-divergence: one formula, the entire family
Each $\alpha$-connection has its own divergence. Amari's standard form (for $\alpha \neq \pm 1$):
Plugging in $\alpha = \pm 1$ directly fails (division by zero), but the limit exists. $\alpha \to 1$: $D_1(p\|q) = D_{KL}(q \| p)$ - reverse KL. $\alpha \to -1$: $D_{-1}(p\|q) = D_{KL}(p \| q)$ - forward KL. Two familiar inhabitants of ML turn out to be endpoints of one continuum.
At $\alpha = 0$ the result is the squared Hellinger distance $H^2(p,q) = \tfrac{1}{2}\int (\sqrt{p} - \sqrt{q})^2\, dx$ - a symmetric f-divergence sitting precisely in the middle. No coincidence: $\alpha=0$ produces Levi-Civita, and the corresponding divergence is symmetric under argument swap.
**Link to Renyi**. The Renyi divergence $R_\beta(p\|q) = \tfrac{1}{\beta-1}\log \int p^\beta q^{1-\beta}\, dx$ at $\beta = (1-\alpha)/2$ is a monotone transform of Amari's α-divergence. Renyi and Amari are one family written in two scales. Renyi $\beta = 1$ recovers KL, $\beta = 1/2$ gives Bhattacharyya / Hellinger, $\beta \to \infty$ delivers the max-divergence used in Differential Privacy.
Which divergence falls out of Amari's α-family at $\alpha = 0$?
Duality of $\nabla^{(\alpha)}$ and $\nabla^{(-\alpha)}$ via the Fisher metric
Duality of $\nabla^{(\alpha)}$ and $\nabla^{(-\alpha)}$ with respect to the Fisher metric
The defining property of the α-family is not interchangeability but mutual complementarity. The connections $\nabla^{(\alpha)}$ and $\nabla^{(-\alpha)}$ are dual with respect to the Fisher metric $g$. Formally - the Codazzi duality condition:
Translation. If $Y$ is parallel-transported using $\nabla^{(\alpha)}$ while $Z$ is transported using $\nabla^{(-\alpha)}$, their inner product under Fisher is preserved. One flow compensates the other. That is the meaning of duality in information geometry: a pair of connections together keep the metric invariant.
Symmetric consequence at the divergence level: $D_\alpha(p\|q) = D_{-\alpha}(q\|p)$. Argument swap = sign flip on $\alpha$. Forward KL ($\alpha=-1$) and reverse KL ($\alpha=+1$) are not two distinct objects but the same divergence viewed from opposite α-poles.
**Pythagorean theorem for α-projections**. e-projection (minimising $D_{KL}(q\|p)$ over $q$ on an e-flat submanifold) and m-projection (minimising $D_{KL}(p\|q)$ over $q$ on an m-flat submanifold) are the $\alpha = \pm 1$ instances of α-projections. Amari's 1985 Pythagorean theorem: when geodesics meet orthogonally inside the dual pair, $D_\alpha(p\|r) = D_\alpha(p\|q) + D_\alpha(q\|r)$ holds exactly. This guarantees monotone convergence of EM, Sinkhorn and many variational schemes.
Duality of $\nabla^{(\alpha)}$ and $\nabla^{(-\alpha)}$ with respect to the Fisher metric means that...
ML: natural gradient, mirror descent, EM as α-structures
ML: natural gradient, mirror descent, EM as α-structures
**Natural gradient** (Amari, 1998). The step $\theta \leftarrow \theta - \eta\, \mathcal{F}^{-1} \nabla L$ is descent along a Levi-Civita geodesic, that is $\alpha=0$. A neutral flow, not biased toward the e- or m-side. K-FAC, Shampoo, Natural Policy Gradient - all approximations of the same $\alpha=0$ motion on the manifold of an exponential family.
**Mirror descent** (Beck-Teboulle, 2003). A step with a Bregman divergence generated by a convex $\psi$ is motion along an m-flat structure, $\alpha = -1$. KL-mirror descent on the simplex ($\psi = $ negentropy) admits a closed form known as exponentiated gradient: $p_{t+1, i} \propto p_{t,i} \cdot e^{-\eta g_i}$. The same Hedge / Multiplicative Weights step that powers AdaBoost and portfolio theory - all of it is m-flat flow.
**EM algorithm** (Csiszar-Tusnady, 1984). Each iteration alternates an e- and an m-projection between the observed distribution $p_{data}$ and the parametric model $p_\theta$. The E-step is an m-projection (computing the posterior over latent variables). The M-step is an e-projection (maximising log-likelihood, i.e. projecting onto the e-flat $\theta$-submanifold). Csiszar and Tusnady proved monotone convergence via Amari's Pythagorean theorem - not a heuristic, a geometric fact.
**Variational inference: forward KL vs reverse KL**. Forward $D_{KL}(p \| q)$ ($\alpha=-1$) is an m-projection - mode-covering: the approximation $q$ stretches to cover every mode of the true $p$. Reverse $D_{KL}(q \| p)$ ($\alpha=+1$) is an e-projection - mode-seeking: $q$ collapses onto a single mode. RLHF and standard VAEs minimise reverse KL precisely because mode-seeking yields sharp, confident distributions, which is what one needs for text generation or sampling. Forward KL appears more rarely - in expectation propagation and in forward-KL VI variants.
Bregman divergence as the $\alpha = \pm 1$ case. Any Bregman divergence $D_\psi$ is an α-divergence at $\alpha = -1$ for the appropriate convex $\psi$ (with $\psi = $ negentropy one recovers forward KL). The symmetric partner at $\alpha = +1$ is reverse Bregman, i.e. Bregman with arguments swapped. The whole Bregman world is therefore two slices of one α-manifold.
α-connections are a purely theoretical construction; in practical ML the Fisher metric alone is enough
Every algorithm implicitly picks an α: SGD - Euclidean flow, natural gradient - α=0, mirror descent with KL - α=-1, EM - alternation of α=±1. The choice of α defines what 'moving straight' means in the model space
These are not different optimisers running at different speeds. They are different geometries, and each suits its own problem class: α=0 - confident parametric models, α=-1 - optimisation over the simplex, α=+1 - mode-seeking approximations. Mixing them up means paying for the wrong geometry on every training step.
RLHF fine-tuning typically minimises $D_{KL}(\pi_\theta \| \pi_{ref})$ as a regulariser. Which α-projection is this and why?
Key takeaways
- **$\nabla^{(\alpha)} = \tfrac{1+\alpha}{2}\nabla^{(e)} + \tfrac{1-\alpha}{2}\nabla^{(m)}$**: a one-parameter family of connections. $\alpha=1$ is e-flat, $\alpha=-1$ is m-flat, $\alpha=0$ is the Levi-Civita connection of the Fisher metric
- **α-divergence $D_\alpha$**: $D_{-1} = D_{KL}(p\|q)$ (forward), $D_{+1} = D_{KL}(q\|p)$ (reverse), $D_0 \propto H^2$ (Hellinger). Renyi $R_\beta$ is the same family in a different scale
- **Duality $\nabla^{(\alpha)} \leftrightarrow \nabla^{(-\alpha)}$**: the dual pair preserves the Fisher metric via the Codazzi identity. Divergence symmetry: $D_\alpha(p\|q) = D_{-\alpha}(q\|p)$
- **ML translation**: natural gradient = $\alpha=0$ flow, mirror descent with KL = $\alpha=-1$, EM = alternating $\alpha=\pm 1$, RLHF reverse KL = $\alpha=+1$. Bregman divergence is the $\alpha = \pm 1$ case
- **Pythagoras for α-projections**: when dual geodesics meet orthogonally, $D_\alpha(p\|r) = D_\alpha(p\|q) + D_\alpha(q\|r)$ holds exactly. The geometric guarantee behind monotone convergence of EM, Sinkhorn and variational schemes
Where this leads
The α-family is the infrastructure. Concrete algorithms are built on top of it:
- Natural gradient ($\alpha=0$) — Levi-Civita flow on the parametric manifold: K-FAC, Shampoo, NPG
- Information projections — e- and m-projections at $\alpha=\pm 1$ - the workhorse of EM, Sinkhorn, expectation propagation
- Mirror descent ($\alpha=-1$) — Bregman projection as m-flat flow on a convex set
- Information geometry in deep learning — VAE, normalising flows, diffusion - built on α-structures over exp-families
Вопросы для размышления
- Bregman divergence is the $\alpha = \pm 1$ case of the α-family. Why do intermediate $\alpha \in (-1, 1)$ fall outside the Bregman picture, and what role do they play instead?
- RLHF minimises reverse KL ($\alpha=+1$, mode-seeking). What happens to policy behaviour if this regulariser is replaced by forward KL ($\alpha=-1$, mode-covering), and why does that empirically lead to unstable generation?
- EM converges monotonically via Amari's Pythagorean theorem - a geometric fact. Yet EM still gets stuck in local maxima. Where exactly does geometry stop helping: in step monotonicity or in the choice of starting point?
Связанные уроки
- ig-02-fisher-metric — α-connections are dual with respect to the Fisher metric - no Fisher, no duality
- ig-04-kl-bregman — α=±1 limits give forward and reverse KL - the canonical Bregman case
- ig-05-dual-flat — e- and m-connections are the endpoints of the α-family
- ig-07-natural-gradient — Natural gradient is the Riemannian flow at α=0 (Levi-Civita)
- ig-09-mirror-descent — Mirror descent with KL is the m-flat (α=-1) Bregman projection
- ig-10-deep-learning — Variational inference and EM in neural nets alternate α=±1 projections
- stat-01-sampling