Information Geometry

Mirror Descent: Optimization in Dual Space

Standard gradient descent treats the probability simplex like a piece of $\mathbb{R}^n$ - takes a step, gets negative probabilities, then projects back. The projection discards all information about simplex geometry. Nemirovski and Yudin in 1983 proposed a different approach: work in the geometry of the problem from the start, using a mirror function that reflects the structure of the feasible set. The consequences are surprising: Hedge (optimal algorithm for N experts), AdaBoost (via regret bounds) and Adagrad (adaptive step in LLM training) are all Mirror descent with different mirrors.

  • **Adagrad and Adam in LLMs:** training GPT-4, LLaMA is Online Mirror Descent with an adaptive Bregman metric. Adagrad is exact Mirror Descent. Adam is approximate, with counterexamples showing divergence (Reddi et al. 2018). AMSGrad fixes this through monotone $\hat{v}_t$
  • **Natural Policy Gradient in RL:** PPO and TRPO (OpenAI) use natural gradient - Mirror Descent with the Fisher metric as mirror. The KL constraint between old and new policy is a Bregman projection, not Euclidean. This explains PPO's stability relative to step size
  • **Hedge in online advertising:** budget allocation across ad channels in real time is a classic N-experts problem. Google and Meta use Multiplicative Weights variants with regret guarantees for auto-bidding systems
  • **Sinkhorn in optimal transport:** the Sinkhorn algorithm for OT distances (used in WGAN, domain adaptation) is Mirror descent with KL-Bregman on the bistochastic manifold. Each row and column normalization is one Mirror descent step

Предварительные знания

  • Bregman divergence as a generalization of squared distance
  • Convex functions and their subdifferentials
  • KL divergence as a Bregman divergence when psi = negative entropy
  • KL as Bregman divergence
  • Natural gradient
  • e- and m-projections

Mirror map and the Mirror Descent algorithm

Why standard gradient descent breaks on the probability simplex

Standard gradient descent: $x_{t+1} = x_t - \eta \nabla f(x_t)$. On the probability simplex $\Delta_n = \{x \geq 0, \sum x_i = 1\}$ this step immediately leaves the feasible set - components go negative. Projecting back onto $\Delta_n$ costs $O(n \log n)$ and ignores the geometry of the simplex: Euclidean distance penalizes a shift near the boundary and near the center equally, even though geometrically these are very different.

**Mirror descent** (Nemirovski, Yudin 1983) resolves this at the root: instead of a step in the primal space $X$, the step is taken in the **dual space** $X^*$ via the gradient of a mirror function $\nabla \psi$ that reflects the geometry of the problem:

Here $\tilde{x}_{t+1} = (\nabla \psi)^{-1}(\nabla \psi(x_t) - \eta \nabla f(x_t))$ is an unconstrained step in dual space, and the second step is a Bregman projection back onto $X$. The **Bregman divergence** $D_\psi$ is:

**Two key Mirror map instances:** - $\psi(x) = \frac{1}{2}\|x\|^2$ (quadratic) $\Rightarrow$ $D_\psi(x,y) = \frac{1}{2}\|x-y\|^2$, $\nabla \psi = \mathrm{Id}$. Mirror descent = standard GD. - $\psi(x) = \sum_i x_i \log x_i$ (negative entropy on $\Delta_n$) $\Rightarrow$ $D_\psi(x,y) = \mathrm{KL}(x \| y)$. Mirror descent = **Exponentiated Gradient**: $x_{t+1}(i) \propto x_t(i) \cdot \exp(-\eta \nabla f(x_t)_i)$. The simplex is preserved automatically.

Requirements on $\psi$: strictly convex, differentiable, $\nabla \psi: X \to X^*$ is a diffeomorphism. This guarantees the dual-space step is uniquely invertible.

Exponentiated Gradient (Mirror descent with $\psi = \sum x_i \log x_i$) updates weights as $x_{t+1}(i) \propto x_t(i) \cdot \exp(-\eta g_t(i))$. Why does the simplex $\sum x_i = 1$, $x_i \geq 0$ stay feasible automatically - with no separate projection?

Hedge and Multiplicative Weights: online learning on the simplex

N experts, T rounds: how to compete with the best?

Online learning setup: at each round $t$ a distribution $w_t$ over $N$ experts is selected, a loss vector $l_t \in [0,1]^N$ is revealed, and the learner incurs loss $\langle w_t, l_t \rangle$. The goal is to minimize **regret** - the gap between cumulative loss and the loss of the single best expert in hindsight:

**Hedge (Freund, Schapire 1997)** is Exponentiated Gradient on the simplex with linear losses $f_t(w) = \langle w, l_t \rangle$, or equivalently Mirror descent with $\psi = \sum w_i \log w_i$:

Regret bound: with $\eta = \sqrt{2 \ln N / T}$, Hedge guarantees $R_T \leq \sqrt{2T \ln N}$. This is **minimax optimal**: no algorithm can guarantee less. Interpretation: over $T$ rounds among $N$ experts, the price of not knowing the best expert in advance is $O(\sqrt{T \ln N})$.

**Boosting as a special case of Hedge.** AdaBoost is Multiplicative Weights where experts are weak classifiers and the loss is per-example error. The regret bound of AdaBoost is the regret bound of Hedge. Similarly, **Follow the Regularized Leader (FTRL)** with entropy regularization is equivalent to Hedge for linear losses.

Hedge regret bound: $R_T \leq \sqrt{2T \ln N}$. What happens to the per-round regret $R_T / T$ as $T \to \infty$?

Natural gradient, Adagrad and Adam as Mirror descent

Unified picture: different Mirror maps - different optimizers

Mirror descent is a family parameterized by the choice of $\psi$. Different $\psi$ produce classical ML optimizers:

**Natural gradient** (Amari 1998) is Mirror descent with $\psi(\theta) = \frac{1}{2} \theta^\top F(\theta) \theta$, where $F$ is the Fisher information matrix. In practice: $\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla \mathcal{L}$. The Bregman divergence here is a local quadratic approximation to the KL divergence between $p_\theta$ and $p_{\theta_0}$.

**Adagrad** (Duchi et al. 2011): $\psi(x) = \frac{1}{2} \sum_i G_{ii} x_i^2$ where $G_{ii} = \sum_{s \leq t} g_{si}^2$ accumulates squared gradients per coordinate. Bregman divergence: $D_\psi(x,y) = \frac{1}{2}(x-y)^\top G (x-y)$. Adaptive metric: sparse coordinates get a large step, frequent coordinates get a small one.

**Adam as approximate Mirror descent.** Adam uses $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ instead of Adagrad's cumulative sum. This is not an exact Mirror descent with a fixed $\psi$ - the metric changes over time. Reddi et al. 2018 (AMSGrad) proved that standard Adam can diverge because $v_t$ is non-monotone. AMSGrad fixes this: $\hat{v}_t = \max(\hat{v}_{t-1}, v_t)$ gives a monotonically increasing Bregman metric - a correct Mirror descent with provable regret bounds. In practice Adam and AMSGrad behave similarly, but AMSGrad has sound theory.

Adam is just SGD with momentum and gradient scaling, with no theoretical grounding

Adam is approximate Online Mirror Descent with an adaptive Bregman metric; its divergence is fixed by AMSGrad, which is a proper Mirror Descent with monotone metric

Reddi et al. 2018 constructed a counterexample where Adam diverges due to non-monotone $v_t$. AMSGrad fixes this: $\hat{v}_t = \max(\hat{v}_{t-1}, v_t)$ makes the Bregman metric monotonically increasing - sufficient for online convex optimization regret bounds. In practice the difference is small, but the theory matters for understanding when and why Adam fails.

Adagrad divides each coordinate's step by $\sqrt{\sum_{s \leq t} g_s^2}$. Which Bregman divergence does this correspond to?

Key takeaways

  • **Mirror descent** = step in dual space via $\nabla \psi$, then Bregman projection. Formula: $\nabla \psi(x_{t+1}) = \nabla \psi(x_t) - \eta \nabla f(x_t)$
  • **$\psi = \sum x_i \log x_i$** (negative entropy): Bregman = KL, algorithm = Exponentiated Gradient. Simplex preserved automatically through softmax invertibility of $\nabla \psi$
  • **Hedge / Multiplicative Weights**: EG on the simplex with linear losses. Regret bound $O(\sqrt{T \ln N})$ - minimax optimal. AdaBoost = Hedge with weak classifiers as experts
  • **Natural gradient**: Mirror Descent with Fisher-Bregman. PPO and TRPO use KL policy constraint - a Bregman projection, explaining stability to step size
  • **Adagrad = Online Mirror Descent** with $G_t = \mathrm{diag}(\sum g_s^2)$. Adam is approximate Mirror Descent; AMSGrad (monotone $\hat{v}_t$) is the correct version with provable regret bounds

What comes next

Mirror descent bridges information geometry and practical optimizers:

  • Information geometry in deep learning — VAE, diffusion, LLM training through Mirror Descent with different Bregman divergences
  • e/m-projections and EM — M-step of EM is a Bregman projection - the same mechanism as a Mirror step

Вопросы для размышления

  • Hedge guarantees regret $O(\sqrt{T \ln N})$ with optimal $\eta = \sqrt{2 \ln N / T}$, but $T$ is typically unknown in advance. AdaHedge and Squint adaptively choose $\eta$ on the fly and achieve the same bound without knowing $T$. What property of the Mirror map allows this, and how does $\psi$ change with an adaptive $\eta$?
  • Natural gradient = Mirror Descent with the Fisher information matrix. For large models (LLMs) the $d \times d$ Fisher matrix is intractable. K-FAC approximates it block-diagonally. Does this weaken the theoretical guarantees of Mirror Descent, and why does K-FAC still outperform SGD in practice?
  • Adagrad accumulates squared gradients monotonically - the step size decays to zero. Adam uses exponential moving average - the step does not decay. Which one is a proper Mirror Descent, and why does Adam still outperform Adagrad in practice despite theoretical issues with divergence?

Связанные уроки

  • ig-08-info-projection — m-projection with Bregman divergence is exactly a Mirror descent step onto a convex set
  • ig-07-natural-gradient — Natural gradient = Mirror descent with mirror psi(theta) = KL, connection through Fisher metric
  • ig-04-kl-bregman — Bregman divergence is the foundation of Mirror map; KL is the special case when psi = negative entropy
  • ig-10-deep-learning — Adam and Adagrad are approximate Mirror descent with adaptive Bregman metrics
  • stat-01-sampling
Mirror Descent: Optimization in Dual Space

0

1

Sign In