Optimal Transport

Discrete OT and the MUSE Algorithm

Цели урока

  • Formulate discrete OT as a linear program and understand the sparsity of extreme point solutions
  • Implement the MUSE alternating OT-Procrustes algorithm for unsupervised cross-lingual alignment
  • Apply discrete OT to visual tasks: color transfer, point cloud matching, and histogram equalization

Предварительные знания

  • Linear programming: feasible sets, extreme points, simplex algorithm
  • Empirical measures and finite support probability distributions
  • Sinkhorn algorithm for approximate OT (for speed-critical applications)
  • SVD and Procrustes problem basics
  • Monge-Kantorovich Problem
  • Sinkhorn Algorithm

Every comparison of two datasets, every style transfer, every cross-lingual NLP model ultimately asks: how far apart are these two point clouds, and how do the points correspond? Discrete optimal transport answers both questions simultaneously - and the answer is used in production at Facebook, Google, and every major NLP lab.

  • MUSE (Facebook AI 2018) uses discrete OT to build bilingual dictionaries from monolingual corpora - no parallel text required
  • Color transfer in Adobe Photoshop and Lightroom uses histogram matching, which is 1D OT in each color channel
  • Single-cell RNA sequencing analysis uses OT to track cell differentiation trajectories over time (Waddington OT)
  • 3D printing quality control: OT distance between designed and scanned point clouds measures manufacturing deviation

From Hitchcock's Transportation Problem to Neural Alignment

The discrete transportation problem was formulated by Frank Hitchcock in 1941 and solved by Dantzig's simplex method in the late 1940s. For decades it remained a logistics problem - shipping goods between warehouses and factories. The NLP revolution came in 2018 when Conneau and Lample realized that word embedding spaces in different languages share the same geometric structure. MUSE uses discrete OT to find the rotation aligning these spaces, achieving state-of-the-art translation without any parallel data. This catalyzed a wave of applications: sentence embedding alignment, code mixing detection, and zero-shot cross-lingual transfer.

Discrete OT as Linear Programming

When source and target distributions are empirical measures supported on finite point clouds, optimal transport reduces to a linear program over a doubly-stochastic matrix. This gives exact solvers with polynomial complexity and a complete combinatorial structure.

The feasible set U(a,b) has at most m+n-1 non-zero entries at LP extreme points (basic feasible solutions). The optimal transport plan is always sparse when solved exactly.

How many non-zero entries does an exact discrete OT solution have at most?

The transport polytope U(a,b) has dimension m*n but only m+n-1 active constraints at a basic feasible solution (LP vertex). The optimal plan is always sparse.

MUSE: Multi-Scale OT for Word Embeddings

MUSE (Multilingual Unsupervised and Supervised Embeddings, Facebook AI 2018) uses discrete OT to align word embedding spaces across languages without any bilingual dictionary. The key insight is that language-agnostic semantic structure can be recovered by finding a rotation that minimizes Wasserstein distance between frequency-weighted word distributions.

MUSE uses the 50k most frequent words weighted by frequency rank. Using all words or uniform weights degrades alignment quality because rare words add noise to the transport plan.

In MUSE's alternating optimization, what problem is solved in the W-update step?

With T fixed, the problem min_{W in O(d)} ||XW - TY||_F is the orthogonal Procrustes problem, solved in closed form via SVD of X^T T Y.

Color Transfer and Point Cloud Matching

Discrete OT is the workhorse of computational visual tasks. Color palette transfer between images, 3D point cloud registration, and histogram matching all reduce to discrete OT problems with different cost matrices.

For point cloud registration (ICP vs. OT-ICP), OT-based alignment is more robust to outliers because the transport plan distributes mass smoothly rather than committing to hard nearest-neighbor assignments.

Why is sliced Wasserstein (SW2) preferred over exact W2 for color transfer on full-resolution images?

Full-resolution images have millions of pixels. Exact W2 is O(n^3 log n) which is infeasible. Sliced W2 projects to 1D where OT has closed form (quantile matching), making the total cost O(n log n) per projection, O(K n log n) for K projections.

The Computational Workhorse of Distribution Comparison

Discrete OT unifies a spectrum of ML tasks under one framework. Any problem that compares two point clouds - cross-lingual alignment, style transfer, domain adaptation, single-cell biology - can be cast as a discrete OT problem. The LP structure gives exact solutions for small n; Sinkhorn and sliced variants scale to millions of points. Understanding discrete OT is understanding how modern ML moves information between distributions.

  • Optimal Transport — Related topic

Итоги

  • Discrete OT is a linear program over transport matrices T in U(a,b); exact solutions are sparse with at most m+n-1 non-zeros
  • Network simplex solves discrete OT exactly in O(n^3 log n); Sinkhorn approximates it in O(n^2/epsilon^2)
  • MUSE uses alternating Sinkhorn (for T) and Procrustes/SVD (for W) to align word embeddings across languages without bilingual supervision
  • Color transfer, point cloud registration, and histogram matching are all special cases of discrete OT with different cost matrices

Вопросы для размышления

  • MUSE uses orthogonal rotations W in O(d) to align embedding spaces. What property of word embeddings justifies the orthogonality constraint, and when might it fail?
  • The sparsity of the exact OT solution (at most m+n-1 non-zeros) implies hard assignments for most points. In MUSE, is hard assignment desirable or does the soft Sinkhorn plan work better? Why?
  • For color transfer, using all pixels as support points makes the LP infeasible to solve exactly. Subsampling to n_colors=500 introduces approximation error. How would one choose n_colors to balance speed and visual quality?

Связанные уроки

  • ot-01-monge — Discrete OT is the finite-dimensional instance of the Monge-Kantorovich problem
  • ot-04-sinkhorn — Sinkhorn is the fastest solver for discrete OT with entropic regularization
  • ot-09-barycenters — Discrete Wasserstein barycenters require solving multiple discrete OT subproblems
  • ot-21 — Entropic regularization converts discrete OT to the Sinkhorn iteration studied here
  • ot-02-kantorovich
  • ot-12-comp-ot
  • ot-08-domain-adaptation
Discrete OT and the MUSE Algorithm

0

1

Sign In