Information Geometry

KL as Bregman Divergence

Behind PPO, TRPO, natural gradient, mirror descent, and VAE ELBO stands one geometry - Bregman divergences. KL divergence is not a special formula. It is one element of a large family defined by the choice of convex function. Change the function - get a different optimization algorithm with different geometry.

**PPO / TRPO**: policy step constraint $D_{KL}(\pi_{new} \| \pi_{old}) \leq \delta$ - Bregman projection onto the set of admissible policies
**VAE ELBO**: term $-D_{KL}(q_\phi \| p)$ in the variational lower bound - Bregman distance from approximation to prior
**Natural gradient (Amari)**: Fisher metric is the Hessian of the log-partition function in exp-family, the Hessian of the generating $\varphi$
**Mirror descent / Hedge**: online optimization over the simplex - multiplicative step as Bregman projection with KL
**Sinkhorn (Cuturi 2013)**: entropic regularization in OT - projection onto transport polytope in KL metric

Предварительные знания

Bregman Divergence

Start with a provocation. Behind PPO, TRPO, natural gradient, mirror descent, and VAE ELBO stands one geometry: Bregman divergences. KL divergence is not a special formula with its own physics. It is one element of a large family generated by a convex function.

Let $\varphi: \mathbb{R}^n \to \mathbb{R}$ be strictly convex and differentiable. The Bregman divergence generated by $\varphi$:

Read directly: the difference between the true value $\varphi(p)$ and the linear approximation of $\varphi$ around $q$, evaluated at $p$. Geometrically - the gap between the function and its tangent hyperplane at $q$. Strict convexity of $\varphi$ guarantees $D_\varphi \geq 0$, with equality if and only if $p = q$.

**Key property**: $D_\varphi$ is asymmetric. $D_\varphi(p \| q) \neq D_\varphi(q \| p)$ in general. This is not a defect - it follows from the geometry of tangent planes built at different base points.

The Bregman divergence $D_\varphi(p \| q) = 0$ if and only if...

KL as a Special Case of Bregman

Choose $\varphi(p) = \sum_i p_i \log p_i$ - negative entropy (negentropy). Substitute into the Bregman formula and simplify (unit terms cancel via normalization $\sum p_i = \sum q_i = 1$):

KL divergence is Bregman with $\varphi = $ negentropy. That is all. No separate mysticism. The asymmetry of KL is the same asymmetry of tangent planes present in every Bregman divergence.

**Why this matters for ML**. VAE trains to minimize ELBO = reconstruction - $D_{KL}(q \| p)$. In PPO the policy step constraint is $D_{KL}(\pi_{new} \| \pi_{old}) \leq \delta$. In both cases - Bregman projection with $\varphi = $ negentropy. One geometry.

Which generating function $\varphi$ yields KL divergence as a Bregman divergence?

Mirror Descent: Steps as Bregman Projections

An ordinary gradient step is a Euclidean projection onto the constraint set. Replace squared Euclidean distance with an arbitrary Bregman divergence $D_\varphi$ - and the result is mirror descent:

With $\varphi = $ negentropy and $\mathcal{C} = $ probability simplex, the closed-form solution is the multiplicative Hedge step:

This is Hedge / Multiplicative Weights - the algorithm behind expert learning, online optimization, and AdaBoost. One algorithm. One geometry. The name 'mirror descent' refers to dual coordinates: the step is taken in the space of $\nabla\varphi$, then mapped back.

**Natural gradient as a special case**. In exponential families the Fisher metric is the Hessian of $\varphi$ (the log-partition function). Amari's natural gradient step is mirror descent with the Bregman divergence generated by the log-partition function. TRPO, Natural Policy Gradient, K-FAC - all use this geometry.

KL(p||q) is just a measure of difference between distributions, like a distance

KL is an asymmetric Bregman divergence. Direction matters: forward KL and reverse KL are geometrically distinct operations

Forward KL $D_{KL}(p \| q)$ (mode-covering) appears in variational inference as the ELBO term. Reverse KL $D_{KL}(q \| p)$ (mode-seeking) appears in other variational schemes. Both are Bregman with the same $\varphi$ but with swapped arguments - projections in opposite directions.

In mirror descent with $\varphi = $ negentropy: $p_{t+1,i} \propto p_{t,i} \cdot e^{-\eta g_i}$. What happens as $\eta \to 0$?

Key ideas

**Bregman divergence** $D_\varphi(p\|q) = \varphi(p) - \varphi(q) - \langle\nabla\varphi(q), p-q\rangle$ - gap between function and tangent hyperplane
**KL = Bregman** with $\varphi = $ negentropy $\sum p_i \log p_i$. Asymmetry of KL is a consequence of tangent plane geometry
**Mirror descent**: replace Euclidean projection with Bregman projection. With $\varphi = $ negentropy - multiplicative Hedge step
**Natural gradient**: Hessian $\nabla^2\varphi$ in exp-family is the Fisher metric. Natural gradient step = mirror descent
**PPO, VAE, Sinkhorn** - all three use KL as Bregman to constrain or regularize the optimization step

What comes next

Bregman structure runs through the entire information-geometric landscape:

Dual flat structure — Dual coordinates in exp-family are Bregman duality
Mirror descent — Detailed treatment of the algorithm as Bregman projection
Natural gradient — Gradient descent in Bregman metric on the exp-family manifold

Вопросы для размышления

Why is squared Euclidean distance a special case of Bregman divergence? Which $\varphi$ generates it and why is the result symmetric?
PPO constrains the policy step via KL. What does this constraint mean geometrically in terms of Bregman projection?
Forward KL and reverse KL are minimized by different algorithms and yield different results. How does the direction of Bregman projection explain this?
Mirror descent with $\varphi = \|\cdot\|^2$ is ordinary gradient descent. What changes algorithmically when $\varphi$ is replaced by negentropy?

Связанные уроки

it-03 — KL divergence definition and core properties
ig-07-natural-gradient — Natural gradient is Bregman in exp-family
ig-09-mirror-descent — Mirror descent is the algorithmic consequence
cvx-07 — Proximal methods: same geometry of projections
ot-03-wasserstein — Wasserstein - alternative divergence outside Bregman family
stat-01-sampling