Optimal Transport
Wasserstein distances
The KL divergence between two Gaussians with no overlap equals infinity. The Wasserstein distance $W_2$ equals the distance between their centers. This is exactly why Arjovsky replaced JS with $W_1$ in 2017 - and training that used to collapse within a few iterations suddenly became stable. Four pages. One number changed in the loss. The end of the mode collapse era.
- **WGAN (Arjovsky et al., 2017)**: replacing JS divergence with $W_1$ stabilized GAN training. A Lipschitz critic with gradient penalty approximates the optimal 1-Lipschitz function via Kantorovich-Rubinstein duality.
- **Flow matching (Lipman, 2022)**: FLUX.1 and Stable Diffusion 3 use OT geodesics for straight-line paths in measure space. The network learns to predict constant velocity along these paths - faster than diffusion.
- **Sinkhorn / GeomLoss**: entropic regularization $\varepsilon H(\gamma)$ makes $W_p$ differentiable in $O(n^2 \log n)$. GeomLoss by Feydy runs this on GPU - Wasserstein as a neural network loss.
- **Domain adaptation**: DANN and related methods minimize $W_p$ between feature distributions of source and target domains. Wasserstein-distance as a measure of domain shift.
- **Point cloud comparison**: PointNet++ and 3D reconstruction use Chamfer distance - an approximation of $W_1$ for point clouds in robotics and LIDAR processing.
Предварительные знания
W_p: infimum over transport plans
The KL divergence between $\mathcal{N}(0,1)$ and $\mathcal{N}(100,1)$ equals $D_{KL} = 5000$. The Wasserstein distance $W_2$ between them equals exactly 100 - the distance between their means. These two numbers are not merely different in magnitude. They describe different geometries. And for training generative models, the right geometry is Wasserstein.
**Wasserstein-p distance** - the minimum transport cost over all valid coupling plans: $$W_p(\mu, \nu) = \left(\inf_{\gamma \in \Gamma(\mu, \nu)} \int_{X \times Y} \|x - y\|^p \, d\gamma(x,y)\right)^{1/p}$$ $\Gamma(\mu,\nu)$ is the set of joint measures with marginals $\mu$ and $\nu$. At $p=1$ this gives the earth mover's distance - literally the cost of moving dirt from one landscape to another. At $p=2$ - a quadratic problem connected to Brenier's theorem.
The earth mover's distance intuition: two distributions act as piles of dirt. $W_1$ is the minimum work to reshape one pile into the other. Each unit of mass travels some distance, and the sum of $\text{mass} \times \text{distance}$ over all moves equals $W_1$. This is why it is sometimes called the earth mover's distance in computer vision literature.
In 1D, computing $W_p$ reduces to sorting: if $F_\mu$ and $F_\nu$ are the CDFs, then $W_p^p = \int_0^1 |F_\mu^{-1}(t) - F_\nu^{-1}(t)|^p dt$. This is $O(n \log n)$ - sort both samples, take the average difference. In higher dimensions the problem becomes a full optimization, but in 1D Wasserstein is just order statistics.
**Sliced Wasserstein** - trick for high-dimensional data: project both distributions onto random 1D directions, compute $W_p$ for each projection (via sorting), then average. Approximates the full $W_p$ in $O(Ln\log n)$ where $L$ is the number of projections. GeomLoss by Feydy implements this on GPU.
What value does $W_1$ give between $\delta_0$ (point mass at zero) and $\delta_5$ (point mass at five)?
Wasserstein vs KL: two different geometries
2017. Arjovsky, Chintala, and Bottou publish WGAN. The central argument across four pages: JS divergence (and KL) between $p_{data}$ and $p_g$ is nearly constant everywhere when supports do not overlap. No signal - no gradient - mode collapse. Replace one number in the loss with $W_1$, and training suddenly stabilizes.
**The key distinction**: KL is not a metric. It is asymmetric ($D_{KL}(p\|q) \neq D_{KL}(q\|p)$), violates the triangle inequality, and equals infinity when $\text{supp}(p) \not\subseteq \text{supp}(q)$. Wasserstein $W_p$ is a genuine metric on the space of probability measures with finite $p$-th moment. It metrizes weak convergence: $W_p(\mu_n, \mu) \to 0$ if and only if $\mu_n \Rightarrow \mu$ and $p$-th moments converge.
Why does KL blow up on disjoint supports? $D_{KL}(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} dx$. If $p(x) > 0$ at a point where $q(x) = 0$ - the integral diverges. At the start of GAN training, supports are almost certainly disjoint: the generator produces random noise, the data contains structured images. Precisely at this critical moment, KL fails as a training signal. Wasserstein gives finite distance and nonzero gradient always.
An additional advantage: metrizing weak convergence. The sequence $\delta_{1/n}$ converges weakly to $\delta_0$ as $n \to \infty$. $W_1(\delta_{1/n}, \delta_0) = 1/n \to 0$ - correct. $\text{KL}(\delta_{1/n} \| \delta_0) = +\infty$ for any $n$ - useless. For diffusion models, where the noise chain is analyzed as convergence of measures, Wasserstein is the natural language.
Why does replacing JS divergence with $W_1$ in WGAN stabilize training?
W_2 geodesics and McCann interpolation
$W_2$ is not just a distance. It is a metric on an infinite-dimensional space of probability measures, and it has geodesics - shortest paths between two measures. McCann showed in 1997 that the optimal path from $\mu$ to $\nu$ under $W_2$ has an explicit form. This result became the foundation for flow matching and score-based generation in 2022.
**McCann interpolation**: if $T^* = \nabla \phi$ is the optimal Monge map (gradient of a convex function $\phi$, Brenier's theorem), the geodesic from $\mu$ to $\nu$ is parametrized as: $$\mu_t = \left((1-t) \cdot \text{id} + t \cdot \nabla\phi\right)_\# \mu, \quad t \in [0,1]$$ At $t=0$ one recovers $\mu$, at $t=1$ - $\nu$. Each particle $x$ travels in a straight line from $x$ to $T^*(x)$ at constant speed. This is straight-line motion in the space of measures.
Flow matching (Lipman et al., 2022; Liu et al., 2022) is a direct application of McCann geodesics. Instead of learning the score function $\nabla \log p_t$ as in diffusion models, the network learns to predict the velocity field $v_t(x)$ that describes particle motion along the geodesic. With OT paths (optimal transport conditional flow matching) trajectories are straight and velocity is constant - training converges significantly faster. FLUX.1 and Stable Diffusion 3 implement exactly this.
$W_2$ geodesics also connect to score matching. The Otto-Villani theorem describes the space of measures equipped with $W_2$ as a Riemannian manifold. The «dance» between score-based generation and optimal transport is two languages for one geometry. Langevin diffusion steps describe gradient flow of the KL functional in the $W_2$ metric. The Fokker-Planck equation is gradient descent under $W_2$. This is not a coincidence.
Wasserstein is just another way to compute a distance between distributions, not fundamentally different from KL
Wasserstein and KL are different mathematical objects suited for different tasks. KL measures information divergence (no metric axioms, explodes on disjoint supports). $W_p$ is a geometric distance, a true metric that respects the structure of the underlying space.
On disjoint supports KL is useless for optimization: gradient is zero. $W_p$ provides a meaningful signal always. This matters practically at early training stages for generative models when supports are far apart.
McCann interpolation $\mu_t = ((1-t)\cdot\text{id} + t \cdot \nabla\phi)_\# \mu$ defines a geodesic in which space?
Key ideas
- **$W_p(\mu,\nu) = \left(\inf_{\gamma \in \Gamma(\mu,\nu)} \int \|x-y\|^p d\gamma\right)^{1/p}$** - earth mover's distance. A true metric. Finite where KL = $\infty$.
- **KL vs Wasserstein**: KL explodes on disjoint supports and violates metric axioms. $W_p$ always provides a meaningful gradient - this is why WGAN works where classic GANs fail.
- **McCann geodesics**: the path $\mu_t = ((1-t)\cdot\text{id} + t \cdot \nabla\phi)_\#\mu$ is the shortest in $W_2$. Particles move in straight lines. Flow matching trains a network to approximate these paths.
- **Diffusion as OT**: gradient flow of KL in the $W_2$ metric is the Fokker-Planck equation. Score matching and optimal transport are two languages for one geometry.
Related topics
Wasserstein metrics are the core of this topic. Theory and practice branch out from here:
- Sinkhorn algorithm — Entropic regularization makes W_p computable and differentiable
- Wasserstein GAN — Direct application of W_1 via Kantorovich duality as a loss
- Flow matching — W_2 McCann geodesics as training paths in FLUX and SD3
- Kantorovich duality — Full theory of Lipschitz potentials and Brenier's theorem
- KL divergence and mutual information — Information-theoretic alternative - different geometry on measure space
Вопросы для размышления
- WGAN requires the critic to be 1-Lipschitz. Weight clipping constrains weights to $[-c, c]$. Why does this approximate the Lipschitz constraint? What is the fundamental problem with this approach compared to gradient penalty?
- Sinkhorn adds an entropy term $\varepsilon H(\gamma)$ to transport cost. As $\varepsilon \to 0$ one recovers exact Wasserstein. As $\varepsilon \to \infty$ the plan $\gamma$ converges to the independent product $\mu \otimes \nu$. Why is the maximum-entropy plan under fixed marginals exactly the independent product?
- Flow matching with OT paths trains a network to predict constant velocity $v_t(x) = T^*(x) - x$. Diffusion models predict the score $\nabla \log p_t(x)$. Both describe the same geodesic in measure space - but in different languages. How are the score and velocity field related through the continuity equation?
Связанные уроки
- ot-07-wgan — WGAN replaces JS with W_1 - direct application of the metric
- ot-04-sinkhorn — Sinkhorn makes W_p computable and differentiable
- ot-11-flow-matching — Flow matching uses W_2 geodesics as training trajectories
- ot-05-dual — Kantorovich duality - full theory of Lipschitz potentials
- it-03 — KL divergence - information-theoretic alternative to Wasserstein
- calc-01-sequences