Optimal Transport
The Sinkhorn Algorithm
Classical OT solver: $O(n^3 \log n)$. Sinkhorn: $O(n^2 / \varepsilon^2)$ and fully parallelizable on GPU. Cuturi 2013 made the problem thousands of times faster. One paper - and optimal transport became a practical tool for machine learning, bioinformatics, and computer vision.
- **Wasserstein GAN (Arjovsky 2017)**: Sinkhorn as a differentiable proxy for Wasserstein distance during generative model training
- **Single-cell genomics (OTT, Scanpy)**: matching cells before and after perturbation - Sinkhorn on matrices of size $10^4 \times 10^4$ in seconds on GPU
- **Domain adaptation**: aligning source and target distributions via Sinkhorn plan for transfer learning
- **JAX/OTT-jax**: differentiable GPU-parallel Sinkhorn with automatic gradient computation through marginals
- **Flow matching**: continuous OT via Sinkhorn as initialization of transport pairs for generative models
Предварительные знания
Entropic Regularization: from LP to Matrices
Classical optimal transport is a linear program: $O(n^3 \log n)$ time, $O(n^2)$ memory. At $n = 1000$ that is seconds on CPU. GPU does not help: LP solvers are sequential by nature.
2013. Marco Cuturi. NeurIPS. A poster in the corner of the hall. The idea fits in two lines: add an entropy penalty to the OT objective - and the problem becomes strictly convex and its solution analytically iterable. Ten years later: 3000+ citations.
The original Kantorovich problem: $\min_{\gamma \in U(a,b)} \langle C, \gamma \rangle$ where $U(a,b)$ is the transport polytope. Add entropic regularization:
As $\varepsilon \to 0$ the solution approaches the original OT. As $\varepsilon \to \infty$ the plan degenerates to the independent product $ab^\top$.
**Why this helps**. The regularized problem is strictly convex with a unique solution. The Lagrangian reveals: optimal $\gamma^*$ has the form $\gamma^*_{ij} = u_i \cdot K_{ij} \cdot v_j$, where $K_{ij} = e^{-C_{ij}/\varepsilon}$ is the kernel matrix and $u, v$ are diagonal scaling vectors. Not an LP - a matrix product.
What happens to the optimal transport plan as $\varepsilon \to 0$ in the entropy-regularized problem?
Sinkhorn Iterations: Alternating Normalization
The optimal solution has the form $\gamma^* = \text{diag}(u) \cdot K \cdot \text{diag}(v)$. Vectors $u, v$ must satisfy the marginal constraints. This is a nonlinear system - solved by a remarkably simple iteration:
Each step - two matrix-vector products and elementwise division. No simplex pivots, no special data structures. The entire algorithm is matmul + div in a loop.
**Geometric interpretation**: each Sinkhorn iteration is a Bregman projection onto one marginal constraint. Alternating projection onto two convex sets converges to their intersection - a classical result in convex optimization.
Complexity: $O(n^2 \cdot T)$ where $T$ is iterations. For fixed accuracy $T = O(1/\varepsilon^2 \cdot \log(1/\varepsilon_\text{err}))$. Classical LP: $O(n^3 \log n)$. The gap is orders of magnitude - especially with GPU parallelization.
Each Sinkhorn iteration geometrically represents...
The Epsilon Tradeoff: Accuracy vs Speed
The parameter $\varepsilon$ is the main control knob. Smaller $\varepsilon$ - more accurate OT approximation, but slower convergence and worse numerical stability. Larger $\varepsilon$ - fast and stable, but blurred plan.
| $\varepsilon$ | Plan quality | Iterations to convergence | Numerical stability |
|---|---|---|---|
| 0.001 | close to true OT | 500-1000 | issues (K nearly binary) |
| 0.01 | good approximation | 50-100 | acceptable |
| 0.05 | slightly blurred plan | 20-30 | good |
| 0.5 | heavily blurred | 5-10 | excellent |
In practice small $\varepsilon$ requires a log-domain implementation: store $\log u, \log v$ instead of $u, v$ and operate in log-space to avoid underflow in $K$. JAX and PyTorch implementations (OTT-jax, geomloss) handle this automatically.
**Wasserstein GAN and Sinkhorn**: in WGAN (Arjovsky 2017) the Wasserstein distance is computed via the dual formulation. Practical implementations use Sinkhorn as a differentiable approximation - it provides gradients with respect to $a$ and $b$ via autodiff. Single-cell genomics: matching cells between conditions - Sinkhorn on matrices of size $10^4 \times 10^4$ on GPU in seconds.
Sinkhorn with small epsilon gives exact optimal transport
Sinkhorn always solves the regularized problem. As epsilon decreases the solution approaches OT, but equals it only in the limit epsilon = 0
The optimal plan of the regularized problem always has full support (all $\gamma_{ij} > 0$), whereas the true OT plan is sparse (support $\leq n + m - 1$ for discrete marginals).
Why is a log-domain implementation of Sinkhorn necessary for small values of $\varepsilon$?
Key ideas
- **Entropic regularization**: $\min \langle C, \gamma \rangle - \varepsilon H(\gamma)$ turns LP into a strictly convex problem with an analytically iterable solution
- **Optimal plan**: $\gamma^* = \text{diag}(u) \cdot K \cdot \text{diag}(v)$, $K_{ij} = e^{-C_{ij}/\varepsilon}$ - everything reduces to matrix products
- **Iterations**: $u \leftarrow a / (Kv)$, $v \leftarrow b / (K^\top u)$ - alternating row and column normalization
- **Bregman interpretation**: each iteration is a projection onto a marginal constraint in KL metric
- **Epsilon tradeoff**: small $\varepsilon$ - more accurate but slower and unstable; log-domain implementation solves underflow
- **Complexity**: $O(n^2 T)$ vs $O(n^3 \log n)$ for classical LP - gap of thousands times on GPU
What comes next
Sinkhorn opens the door to practical OT applications:
- Wasserstein GAN — Sinkhorn as differentiable distance criterion during generator training
- Flow matching — Continuous OT via Sinkhorn pairs as the foundation of modern generative models
- Domain adaptation via OT — Using Sinkhorn to align distributions in transfer learning
Вопросы для размышления
- Why does entropic regularization make the OT problem strictly convex, whereas the original LP formulation may have infinitely many optimal solutions?
- Sinkhorn performs alternating Bregman projections onto two marginal constraints. What guarantees convergence of this procedure?
- As $\varepsilon \to 0$ the regularized plan approaches the OT plan but always has full support. How does this affect applications where a sparse transport plan is required?
- In JAX one can differentiate through Sinkhorn iterations with respect to marginals $a$, $b$, and the cost matrix $C$. How is this used in flow matching?
Связанные уроки
- ot-03-wasserstein — Basic OT formulation without regularization
- ot-07-wgan — WGAN uses Sinkhorn as a differentiable proxy
- ot-11-flow-matching — Flow matching relies on efficient OT via Sinkhorn
- ig-04-kl-bregman — Sinkhorn is a Bregman projection with KL divergence
- it-03 — KL divergence as the entropic regularizer
- calc-01-sequences