Information Geometry
KL as Bregman Divergence
Behind PPO, TRPO, natural gradient, mirror descent, and VAE ELBO stands one geometry - Bregman divergences. KL divergence is not a special formula. It is one element of a large family defined by the choice of convex function. Change the function - get a different optimization algorithm with different geometry.
- **PPO / TRPO**: policy step constraint $D_{KL}(\pi_{new} \| \pi_{old}) \leq \delta$ - Bregman projection onto the set of admissible policies
- **VAE ELBO**: term $-D_{KL}(q_\phi \| p)$ in the variational lower bound - Bregman distance from approximation to prior
- **Natural gradient (Amari)**: Fisher metric is the Hessian of the log-partition function in exp-family, the Hessian of the generating $\varphi$
- **Mirror descent / Hedge**: online optimization over the simplex - multiplicative step as Bregman projection with KL
- **Sinkhorn (Cuturi 2013)**: entropic regularization in OT - projection onto transport polytope in KL metric
Предварительные знания
Bregman Divergence
Start with a provocation. Behind PPO, TRPO, natural gradient, mirror descent, and VAE ELBO stands one geometry: Bregman divergences. KL divergence is not a special formula with its own physics. It is one element of a large family generated by a convex function.
Let $\varphi: \mathbb{R}^n \to \mathbb{R}$ be strictly convex and differentiable. The Bregman divergence generated by $\varphi$:
Read directly: the difference between the true value $\varphi(p)$ and the linear approximation of $\varphi$ around $q$, evaluated at $p$. Geometrically - the gap between the function and its tangent hyperplane at $q$. Strict convexity of $\varphi$ guarantees $D_\varphi \geq 0$, with equality if and only if $p = q$.
**Key property**: $D_\varphi$ is asymmetric. $D_\varphi(p \| q) \neq D_\varphi(q \| p)$ in general. This is not a defect - it follows from the geometry of tangent planes built at different base points.
Different choices of $\varphi$ generate different divergences. Squared Euclidean distance comes from $\varphi(x) = \|x\|^2$: then $D_\varphi(p,q) = \|p - q\|^2$. Symmetric - because the Hessian of $\|\cdot\|^2$ is constant. For other $\varphi$, symmetry is absent.
The Bregman divergence $D_\varphi(p \| q) = 0$ if and only if...
KL as a Special Case of Bregman
Choose $\varphi(p) = \sum_i p_i \log p_i$ - negative entropy (negentropy). Substitute into the Bregman formula and simplify (unit terms cancel via normalization $\sum p_i = \sum q_i = 1$):
KL divergence is Bregman with $\varphi = $ negentropy. That is all. No separate mysticism. The asymmetry of KL is the same asymmetry of tangent planes present in every Bregman divergence.
**Why this matters for ML**. VAE trains to minimize ELBO = reconstruction - $D_{KL}(q \| p)$. In PPO the policy step constraint is $D_{KL}(\pi_{new} \| \pi_{old}) \leq \delta$. In both cases - Bregman projection with $\varphi = $ negentropy. One geometry.
Which generating function $\varphi$ yields KL divergence as a Bregman divergence?
Mirror Descent: Steps as Bregman Projections
An ordinary gradient step is a Euclidean projection onto the constraint set. Replace squared Euclidean distance with an arbitrary Bregman divergence $D_\varphi$ - and the result is mirror descent:
With $\varphi = $ negentropy and $\mathcal{C} = $ probability simplex, the closed-form solution is the multiplicative Hedge step:
This is Hedge / Multiplicative Weights - the algorithm behind expert learning, online optimization, and AdaBoost. One algorithm. One geometry. The name 'mirror descent' refers to dual coordinates: the step is taken in the space of $\nabla\varphi$, then mapped back.
**Natural gradient as a special case**. In exponential families the Fisher metric is the Hessian of $\varphi$ (the log-partition function). Amari's natural gradient step is mirror descent with the Bregman divergence generated by the log-partition function. TRPO, Natural Policy Gradient, K-FAC - all use this geometry.
KL(p||q) is just a measure of difference between distributions, like a distance
KL is an asymmetric Bregman divergence. Direction matters: forward KL and reverse KL are geometrically distinct operations
Forward KL $D_{KL}(p \| q)$ (mode-covering) appears in variational inference as the ELBO term. Reverse KL $D_{KL}(q \| p)$ (mode-seeking) appears in other variational schemes. Both are Bregman with the same $\varphi$ but with swapped arguments - projections in opposite directions.
In mirror descent with $\varphi = $ negentropy: $p_{t+1,i} \propto p_{t,i} \cdot e^{-\eta g_i}$. What happens as $\eta \to 0$?
Key ideas
- **Bregman divergence** $D_\varphi(p\|q) = \varphi(p) - \varphi(q) - \langle\nabla\varphi(q), p-q\rangle$ - gap between function and tangent hyperplane
- **KL = Bregman** with $\varphi = $ negentropy $\sum p_i \log p_i$. Asymmetry of KL is a consequence of tangent plane geometry
- **Mirror descent**: replace Euclidean projection with Bregman projection. With $\varphi = $ negentropy - multiplicative Hedge step
- **Natural gradient**: Hessian $\nabla^2\varphi$ in exp-family is the Fisher metric. Natural gradient step = mirror descent
- **PPO, VAE, Sinkhorn** - all three use KL as Bregman to constrain or regularize the optimization step
What comes next
Bregman structure runs through the entire information-geometric landscape:
- Dual flat structure — Dual coordinates in exp-family are Bregman duality
- Mirror descent — Detailed treatment of the algorithm as Bregman projection
- Natural gradient — Gradient descent in Bregman metric on the exp-family manifold
Вопросы для размышления
- Why is squared Euclidean distance a special case of Bregman divergence? Which $\varphi$ generates it and why is the result symmetric?
- PPO constrains the policy step via KL. What does this constraint mean geometrically in terms of Bregman projection?
- Forward KL and reverse KL are minimized by different algorithms and yield different results. How does the direction of Bregman projection explain this?
- Mirror descent with $\varphi = \|\cdot\|^2$ is ordinary gradient descent. What changes algorithmically when $\varphi$ is replaced by negentropy?
Связанные уроки
- it-03 — KL divergence definition and core properties
- ig-07-natural-gradient — Natural gradient is Bregman in exp-family
- ig-09-mirror-descent — Mirror descent is the algorithmic consequence
- cvx-07 — Proximal methods: same geometry of projections
- ot-03-wasserstein — Wasserstein - alternative divergence outside Bregman family
- stat-01-sampling