Optimal Transport

Kantorovich Relaxation and the Wasserstein Distance

1942. Leningrad under siege. Kantorovich solves a resource allocation problem - and in doing so invents the tool that fixes GANs 75 years later. Wasserstein GAN, in four pages, resolved the central failure of generative training - mode collapse - by swapping KL divergence for one mathematical primitive. The shock number: KL divergence between two point masses at different locations is infinite. Wasserstein distance between them is simply the distance between the points. Not a technicality. A different geometry of the world.

  • **WGAN (Arjovsky, 2017)**: replacing JS divergence with $W_1$ eliminated mode collapse in GAN training. Stable Diffusion, DALL-E - descendants of this work.
  • **Flow matching (Lipman, 2022)**: FLUX, Stable Diffusion 3 use OT-derived paths from Wasserstein geometry to train generative flows directly without diffusion steps.
  • **Domain adaptation**: Wasserstein measures distribution shift between source and target domains. DANN and related methods minimize $W_p$ to align feature spaces.
  • **Point clouds**: comparing 3D objects in PointNet++ and NeRF via Chamfer distance - a Wasserstein approximation for point clouds. Used in robotics and LIDAR.
  • **Sinkhorn in deep learning**: GeomLoss (Feydy) implements differentiable $W_p$ on GPU via entropy regularization - Wasserstein loss as a drop-in loss function for neural networks.

Предварительные знания

  • Monge's problem: moving a mountain with minimum work

From maps to plans: the coupling idea

1942. Leningrad under siege. Leonid Kantorovich, a 30-year-old mathematician, works on resource allocation under scarcity. Monge in 1781 posed the problem: find a map $T$ that sends each unit of mass to exactly one destination. Rigid. Deterministic. Kantorovich asks: what if splitting is allowed?

Here is where Monge breaks. Two factories $\{A, B\}$ with unit supply, two warehouses $\{C, D\}$ with unit demand. A Monge map must send $A \to C$ or $A \to D$, and $B$ to the other. But what if the optimal solution ships half of $A$ to $C$ and half to $D$? A map cannot do this - it is a function, not a measure.

**Kantorovich's conceptual shift**: instead of a transport map $T: X \to Y$, seek a **transport plan** - a joint probability measure $\gamma$ on $X \times Y$ whose marginals match the given measures: $\gamma(A \times Y) = \mu(A)$ and $\gamma(X \times B) = \nu(B)$. The set of all such measures is denoted $\Gamma(\mu, \nu)$.

A transport plan $\gamma(x, y)$ is a flow matrix: $\gamma(x, y) \, dx \, dy$ encodes how much mass from $x$ goes to $y$. In the discrete case it is simply a matrix $P_{ij} \geq 0$ with row sums $\mu_i$ and column sums $\nu_j$. These are doubly stochastic matrices (up to normalization). In linear programming this is a polytope - a compact convex set. A minimum over a compact set is always attained.

A key structural fact: every Monge map $T$ induces a Kantorovich plan $\gamma_T = (id, T)_\# \mu$ - a deterministic plan with all mass concentrated on the graph of $T$. But not every plan arises from a map. Kantorovich strictly generalizes Monge: $\min_{\text{Monge}} \geq \min_{\text{Kantorovich}}$.

Why does Monge's problem have no solution for two point masses $\mu = \delta_0$ and $\nu = \frac{1}{2}\delta_{-1} + \frac{1}{2}\delta_{1}$?

The Wasserstein distance: geometry of measures

KL divergence between $\delta_0$ and $\delta_1$ is infinite. The Wasserstein distance between them is 1 - simply the distance between the points. This is not a technicality. It is a fundamental difference in what "closeness" of distributions means.

**Wasserstein-p distance** ($W_p$): the minimum transport cost over all admissible plans: $$W_p(\mu, \nu) = \left(\inf_{\gamma \in \Gamma(\mu, \nu)} \int_{X \times Y} c(x,y)^p \, d\gamma(x,y)\right)^{1/p}$$ where $c(x,y)$ is the unit transport cost (typically $|x - y|$ or $|x - y|^2$). At $p=1$ this is a linear program. At $p=2$ it connects tightly with Brenier's theorem on optimal maps.

Why call it a distance? The three metric axioms hold: $W_p(\mu, \nu) = 0 \Leftrightarrow \mu = \nu$, symmetry, and the triangle inequality. Moreover, $W_p$ metrizes weak convergence of measures - precisely the convergence notion needed in probability theory and generative modeling.

What is the advantage over KL? For two Gaussians $\mathcal{N}(0,1)$ and $\mathcal{N}(100, 1)$, KL = 5000, while $W_2 \approx 100$ - just the distance between centers. When distributions have non-overlapping support (the typical situation during GAN training), KL is useless as a training signal: its gradient is zero almost everywhere. Wasserstein yields a meaningful gradient even for well-separated distributions.

**Sliced Wasserstein**: project both distributions onto random 1D directions, compute $W_p$ for each projection (in 1D this reduces to sorting - $O(n \log n)$), then average. The result approximates $W_p$ in the original space at cost $O(Ln \log n)$, where $L$ is the number of projections. GeomLoss (PyTorch) implements this in a few lines.

Why is the Wasserstein distance preferable to KL divergence as a training loss for generative models when $p_{data}$ and $p_{model}$ have non-overlapping support?

Kantorovich duality: prices as potentials

Kantorovich did not merely relax Monge's problem - he found its dual. The primal problem: find the optimal transport plan. The dual problem: find optimal price potentials. Both optima coincide - strong duality of linear programming.

**Kantorovich duality** for $W_1$ (cost = distance): $$W_1(\mu, \nu) = \sup_{f: \text{1-Lip}} \left( \int f \, d\mu - \int f \, d\nu \right) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim \mu}[f(x)] - \mathbb{E}_{y \sim \nu}[f(y)]$$ where the supremum is over all 1-Lipschitz functions ($|f(x) - f(y)| \leq |x - y|$). This is the Kantorovich-Rubinstein theorem from 1958.

The economic interpretation of the dual is elegant. $f(x)$ is the price of a good at location $x$. The 1-Lipschitz condition means prices cannot grow faster than the distance between locations (otherwise it would be profitable to transport the good directly). The optimal $f$ represents arbitrage-free prices. The difference $\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]$ is the maximum resale profit - and it equals the cost of optimal transport.

This dual formulation is the mathematical engine of Wasserstein GAN. In WGAN (Arjovsky et al., 2017), the critic $D_\theta$ is trained to approximate the optimal 1-Lipschitz function $f^*$. Gradient penalty (or weight clipping) enforces the Lipschitz constraint. The generator minimizes $\mathbb{E}_{x \sim p_g}[D_\theta(x)] - \mathbb{E}_{x \sim p_{data}}[D_\theta(x)]$ - literally approximating $W_1(p_g, p_{data})$. Four pages of mathematics. Mode collapse largely disappears.

The dual formulation also opens the door to **differentiable OT**. The Sinkhorn algorithm regularizes the primal problem with an entropy term $\varepsilon H(\gamma)$, yielding a smooth approximation to $W_p$ with gradients through the distribution parameters. GeomLoss (Feydy et al.) implements this on GPU, making Wasserstein loss practical in deep learning pipelines.

The Wasserstein distance is just a more complex version of KL divergence for the same tasks

Wasserstein and KL measure fundamentally different things. KL is an information-theoretic divergence based on density ratios. Wasserstein is a geometric distance that respects the metric structure of the space. On non-overlapping supports, KL = infinity; Wasserstein = the distance between supports.

This distinction is critical for training generative models, domain adaptation, and any setting where distributions with non-overlapping support need to be compared.

In WGAN, the critic is trained with a 1-Lipschitz constraint. According to Kantorovich duality, what does it approximate?

Key ideas

  • **Monge relaxation**: instead of a map $T$, seek a plan $\gamma \in \Gamma(\mu,\nu)$ - a joint measure with fixed marginals. The minimum is always attained (compact polytope).
  • **$W_p(\mu,\nu) = \left(\inf_{\gamma \in \Gamma} \int c^p \, d\gamma\right)^{1/p}$** - the Wasserstein distance. Metrizes weak convergence. Finite where KL = $\infty$.
  • **Kantorovich duality**: $W_1 = \sup_{f: \text{1-Lip}} \mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]$. This formula is exactly the WGAN critic objective.
  • **Nobel Prize in Economics 1975**: Kantorovich (the only mathematician) for linear programming and resource allocation - the same mathematical apparatus.

Related topics

Kantorovich's relaxation is the foundation. Much is built on top:

  • Kantorovich duality — Full theory of potentials and Brenier's theorem
  • Sinkhorn algorithm — Entropy regularization makes Wasserstein differentiable
  • Wasserstein GAN — Direct application of $W_1$ as generative model training loss
  • KL divergence and cross-entropy — Alternative metric on distributions - comparison of approaches

Вопросы для размышления

  • Strong LP duality states that the primal minimum equals the dual maximum. What does this mean in economic terms: when do the shipper and the price-setting middleman agree on the same value?
  • Sinkhorn adds $\varepsilon H(\gamma)$ to the transport cost. As $\varepsilon \to 0$, the exact Wasserstein is recovered. As $\varepsilon \to \infty$, something else emerges. What distribution $\gamma$ maximizes entropy subject to fixed marginals?
  • WGAN requires the critic to be 1-Lipschitz. Weight clipping enforces this by bounding all weights in $[-c, c]$. Why does this approximate the Lipschitz constraint? What is the fundamental problem with this approach, and why does gradient penalty address it better?

Связанные уроки

  • ot-05-dual — Kantorovich duality: Lipschitz functions as price potentials
  • ot-04-sinkhorn — Sinkhorn regularizes Wasserstein, making it differentiable
  • ot-07-wgan — WGAN uses W1 as training loss instead of JS divergence
  • it-03 — KL divergence vs Wasserstein: two fundamentally different ways to measure closeness of measures
  • ot-01-monge — Monge's problem - the rigid version that Kantorovich generalizes
  • calc-01-sequences
Kantorovich Relaxation and the Wasserstein Distance

0

1

Sign In