Optimal Transport

Monge's problem: moving a mountain with minimum work

1781: French mathematician Gaspard Monge poses the problem - how to move earth during fortification construction with minimum work? The problem had no general solution for 200 years. Kantorovich solved it in 1942 and received a Nobel Prize. Today the same equation is the backbone of Wasserstein GAN and Stable Diffusion 3.

WGAN (2017) - first reliably trainable GAN by replacing KL with Wasserstein distance
Flow Matching - backbone of Stable Diffusion 3, Meta MovieGen: 5-10x faster than diffusion
Domain adaptation: OT-DA transfers medical models across hospitals without relabeling
Sinkhorn in CLIP, DINOv2: differentiable soft-assignment between image and text embeddings

Предварительные знания

Functions of Several Variables

Monge's problem: transport map T and total cost

**1781, the memoir 'Sur la théorie des déblais et des remblais'.** Military engineer Gaspard Monge works on a practical problem: the army builds fortifications, earth is dug from quarries (déblais) and piled into ramparts (remblais). Volumes match. The question: which transport plan minimizes total work? Monge frames this as the search for a map T: quarries → ramparts that minimizes the cost integral. 240 years later, this equation runs on GPU clusters to turn photos into Van Gogh paintings.

Monge's problem in one line: given two mass distributions μ (source) and ν (target) of equal total mass. Find a map T: X → Y that moves μ to ν such that total cost **∫ c(x, T(x)) dμ(x)** is minimal. Formally: inf_T ∫ c(x, T(x)) dμ subject to T#μ = ν.

Three key words of the problem. **Distributions**, not points: the problem is about mass, not coordinates. **Map T**, not correlation: each source point is assigned exactly one target point - mass cannot split. **Cost c(x, y)**: the price of hauling a unit mass from x to y, typically ||x - y||² or ||x - y||. The condition T#μ = ν is called the pushforward - mass conservation.

What does the condition T#μ = ν mean in Monge's problem?

Kantorovich relaxation: from map T to plan π

Monge's problem in pure form sometimes has no solution. Simplest counterexample: μ is a point mass at 0, ν is two point masses of 1/2 at -1 and +1. Any map T must send the point 0 somewhere single - either -1 or +1. The pushforward gives a point mass, not the desired two-point distribution. No solution exists. The reason: a map T is a rigid structure, mass cannot split.

**Kantorovich relaxation (1942)**: instead of a map T, seek a joint distribution π on X × Y with marginals μ and ν. Problem: min_π ∫∫ c(x,y) dπ(x,y) subject to marginals matching μ and ν. This is already a **linear program** - always solvable. A map T corresponds to π supported on the graph of T. Kantorovich allows mass to 'spread out'.

For smooth distributions and quadratic cost, **Brenier's theorem** guarantees: the optimal Kantorovich plan is supported on the graph of some map T. Monge and Kantorovich give the same answer - but Kantorovich is always defined. This is why all modern computational OT is built on the Kantorovich formulation.

Why did Kantorovich win a Nobel Prize for relaxing Monge's problem rather than just solving an optimization problem?

Wasserstein distance: metric on the space of distributions

The minimum transport cost from the Kantorovich problem with cost c(x,y) = ||x-y||^p gives the **Wasserstein distance**: W_p(μ, ν) = (inf_π ∫∫ ||x-y||^p dπ(x,y))^(1/p). This is a proper metric on the space of probability measures: symmetric, satisfies the triangle inequality, equals zero if and only if μ = ν.

**Why Wasserstein beats KL for ML**: KL(μ || ν) = +∞ if the support of μ is not contained in the support of ν. In early GAN training, the generator and real distributions barely overlap - KL gives infinity or NaN, gradient vanishes. Wasserstein is always defined and provides a smooth nonzero gradient even for disjoint distributions. That was the key breakthrough of WGAN (Arjovsky 2017).

**OT in production ML**: WGAN (Arjovsky 2017) - stable generative model loss via Wasserstein-1. Flow Matching (Lipman 2022) - the backbone of Stable Diffusion 3 and Meta MovieGen, trains 5-10x faster than classical diffusion. Domain adaptation via OT-DA (Courty-Flamary) - transferring medical models across hospitals, sensors, production lines. Anywhere two distributions need to be compared or aligned: first question should be whether Wasserstein is needed.

KL-divergence between two distributions with disjoint supports equals +∞. How does this affect GAN training?

Summary

Monge's problem: inf_T ∫ c(x, T(x)) dμ subject to T#μ = ν - minimizing transport cost from μ to ν via map T
T#μ = ν (pushforward): if X ~ μ then T(X) ~ ν. This is the 'correctness' condition for a WGAN generator
Monge breaks down: discrete distributions where mass cannot split. A point mass cannot map to a sum of two point masses
Kantorovich relaxation: plan π instead of map T. A linear program - always solvable
Wasserstein W_p: metric on the space of measures. Defined even for disjoint supports - unlike KL
Cost c(x,y) is part of the problem statement: quadratic → Brenier map, Euclidean → Wasserstein-1, perceptual → OT in feature space

Where to go next

Monge posed the question. Next come the solution tools and applications.

Kantorovich relaxation — LP problem, always solvable. Foundation for every computational OT algorithm
Sinkhorn algorithm — Entropic regularization: OT in O(n²) iterations, differentiable end-to-end
Brenier's theorem — For quadratic cost, the optimal T exists, is unique, and is the gradient of a convex function

Вопросы для размышления

Which tasks in the team compare two distributions? What divergence is used, and would Wasserstein be more appropriate?
If loss function choice in a generative model were framed as 'which cost c do I want to minimize', what would change in the architecture?
Which production tasks (drift detection, fairness, A/B) are formally questions of 'how far is μ from ν'?

Связанные уроки

calc-26-divergence-theorem