Optimal Transport
Regularized OT: Sinkhorn Algorithm
Цели урока
- Understand how the entropic penalty converts LP to a strictly convex problem with an analytic solution form
- Implement the Sinkhorn algorithm and analyze convergence for different ε values
- Use dual potentials for numerically stable computations
Предварительные знания
- Kantorovich optimal transport problem (LP formulation)
- LP duality theory
- KKT optimality conditions
2013, NeurIPS. One poster with the formula e^{-C/ε} triggered a chain reaction: Sinkhorn appeared in diffusion models, WGAN, flow matching, sequence alignment. A trick known in statistical physics for 70 years turned out to be the key to scalable OT.
- Stable Diffusion 3 uses Flow Matching with Sinkhorn-coupled batches: 512 images per pass, geodesic trajectories instead of random pairings
- Google Brain: embedding alignment for model distillation - Sinkhorn finds optimal teacher/student pairs
- Scanpy (single-cell RNA): optimal transport between time points of cell differentiation - 50,000 cells in seconds
- WGAN-GP in medical image synthesis: Sinkhorn as a differentiable Wasserstein loss for training diagnostic GANs
From telephone networks to diffusion models
Richard Sinkhorn and Paul Knopp proved convergence of alternating matrix normalizations in 1967, with no connection to transport - purely as a result about stochastic matrices. The Schrodinger problem of 1931 was posed by a physicist to describe particle diffusion, with no optimization. Marco Cuturi in 2013 unified them in a single NeurIPS paper. Computational complexity dropped from O(n³) to O(n²/ε²). In 10 years the method penetrated every major generative model architecture.
Entropic Regularization: from O(n³) to O(n²)
2013, ENSAE lab. Marco Cuturi published a NeurIPS paper - and optimal transport stopped being computationally out of reach for machine learning. Before: solving OT for 10,000 points took hours on a server. After: seconds on a GPU. One entropy trick changed everything.
Matrix K = exp(-C/ε) is the key object. As ε decreases to 0, K concentrates near optimal pairs (i,j). As ε grows to infinity, K becomes uniform. The parameter ε controls the balance between accuracy and plan smoothness.
The structure P* = diag(u)·K·diag(v) is not a coincidence. It is precisely the form of a Gibbs matrix from statistical physics. The Sinkhorn problem is: normalize a Gibbs matrix to given marginals. Physics and transport turned out to be the same thing.
What happens to the Sinkhorn optimal plan as ε approaches 0?
As ε decreases to 0, the regularized solution converges to the maximum-entropy plan among all optimal OT plans. As ε grows to infinity, the plan converges to the independent product a⊗b.
The Sinkhorn Algorithm: Alternating Normalizations
Solving the nonlinear system u ⊙ (Kv) = a, v ⊙ (K⊤u) = b looks intimidating. But fix v and solve for u - it is just division. Then fix u and solve for v. Alternate. That is the entire Sinkhorn algorithm.
Log-domain Sinkhorn: at small ε, matrix K contains numerical zeros (underflow). The solution is to work in log-domain: log(u), log(v), log(K). Multiplications become additions; exp/log are applied only to small numbers.
Why does entropic regularization make the OT problem strictly convex?
A linear program has its optimum at a vertex of the feasible polytope - usually several optimal plans. Adding the strictly convex penalty -εH(P) creates a unique minimizer inside the polytope.
Duality and Sinkhorn Potentials
Every optimization problem has a dual. Regularized OT is no exception. The dual variables f, g are called Sinkhorn potentials. Their connection to the primal: u = exp(f/ε), v = exp(g/ε). This is more than a substitution - potentials have the physical meaning of prices.
At ε < 0.01 with large C_{ij}, matrix K = exp(-C/ε) may contain zeros due to float64 underflow. Always check min(K) and switch to log-domain Sinkhorn when results look suspicious.
Sinkhorn in POT and GeomLoss
Python Optimal Transport (POT): ot.sinkhorn(a, b, C, reg=0.1) - one line. GeomLoss (GPU): SamplesLoss('sinkhorn', p=2, blur=0.05) for mini-batches. In Stable Diffusion 3, geodesic Flow Matching transport is implemented via GeomLoss with Sinkhorn for batches of 512+ images.
What are Sinkhorn potentials f, g and how do they relate to the algorithm?
Potentials f, g are dual variables of regularized OT. Interpreted as prices: f_i is the cost of sending unit mass from point i. The connection u = exp(f/ε) allows switching between primal and dual representations.
Where this topic leads
Entropic regularization is the foundation for scalable OT. Without it, WGAN, flow matching, and large-scale distillation are impossible. Next steps: the Schrodinger bridge (ot-13) as a continuous version, and flow matching (ot-25) as an application to generative models.
- Optimal Transport — Related topic
Итоги
- Entropic penalty -εH(P) converts LP to a strictly convex problem with unique solution P* = diag(u)·K·diag(v)
- Sinkhorn algorithm: alternate u = a/(Kv) and v = b/(K⊤u) - each iteration O(n²), linear convergence
- Parameter ε controls accuracy: ε to 0 = exact OT, ε to infinity = independent product a⊗b
- Log-domain computation is mandatory at small ε for numerical stability
Вопросы для размышления
- Why is log-domain Sinkhorn numerically more stable than the direct form at ε < 0.01?
- How should ε be chosen in practice: what factors affect the balance between accuracy and convergence speed?
- How does the Sinkhorn divergence W_ε(µ,ν) - 0.5·W_ε(µ,µ) - 0.5·W_ε(ν,ν) eliminate the regularization bias?
Связанные уроки
- ot-04-sinkhorn — baseline Sinkhorn algorithm is the foundation here
- ot-13-schrodinger — Schrodinger bridge - physical interpretation of the same algorithm
- ot-24-entropic-reg — deeper entropic regularization and Schrodinger bridge
- ot-25-flow-matching — Flow Matching builds on regularized OT coupling