Optimal Transport

Gromov-Wasserstein: OT Between Different Spaces

How do you align word2vec spaces of Russian and English without a translation dictionary? Facebook MUSE (2018) did it with OT + GW: Wasserstein aligned structures and Procrustes found the mapping. This produced an EN-RU dictionary from scratch via the geometry of language spaces. GW is the tool for problems with no shared metric.

  • **Facebook MUSE:** cross-lingual alignment of 30+ languages via GW/OT without bilingual corpora. Open source, 4000+ stars on GitHub.
  • **Word Mover Distance:** semantic search in Elasticsearch, document similarity in production. Slower than tf-idf but much more accurate for semantic retrieval.
  • **Grakel / PyTorch Geometric:** GW for graph kernels, molecular fingerprinting, protein structure alignment in drug discovery.

Gromov-Wasserstein: Match Structures, Not Points

MUSE (Facebook AI, 2018) aligned embeddings of 30 languages without parallel corpora using Gromov-Wasserstein distance. The vector 'dog' (en) is matched with the vector for 'dog' (ru) without a shared space: pairwise-distance structures are aligned. The same idea is used in AlphaFold for comparing 3D protein structures.

Classical OT: match points from X and Y via a cost C(x, y). Requires X and Y to live in the same space, or an explicit cross-space cost. GW: match structures via intra-space distances. Cost: how much distances are distorted by the mapping.

**Gromov-Wasserstein distance:** GW(mu, nu) = min over P in Pi(mu, nu) of sum_{ijkl} L(d_X(x_i, x_k), d_Y(y_j, y_l)) * P_{ij} * P_{kl} where L(a, b) = (a - b)^2 (standard choice). d_X is the metric in X, d_Y is the metric in Y. **Meaning:** P is 'good' if when mapping i -> j and k -> l it preserves d_X(x_i, x_k) approximately d_Y(y_j, y_l). **Non-linear problem:** unlike classical OT, this is quadratic in P. No guarantee of a global minimum. **Fused GW (FGW):** lambda * GW + (1 - lambda) * OT combines structural and content-based comparison.

Why is the Gromov-Wasserstein problem non-linear, while classical OT is linear in P?

In classical OT the objective sum_{ij} C_{ij} P_{ij} is linear in P. In GW the term P_{ij} P_{kl} appears - a product of two plan entries - giving a quadratic problem. Frank-Wolfe linearizes it iteratively.

Word Mover Distance and Aligning Language Spaces

**Word Mover Distance (Kusner 2015):** distance between documents as OT between word distributions with cost = L2 in word2vec/GloVe space. It outperformed tf-idf and LDA on text-classification tasks without training. Used in semantic search and document retrieval.

**Aligning language spaces (MUSE, Facebook 2018):** Given: word embeddings X (English, n x d) and Y (Russian, m x d), trained separately. Goal: find an orthogonal mapping W with W * X approximately Y for translation pairs. MUSE algorithm: 1. OT finds a soft alignment between the most frequent words. 2. Procrustes solution: W = argmin over W^T W = I of ||W X_aligned - Y_aligned||^2. W = U * V^T via SVD(Y_aligned * X_aligned^T) = U * S * V^T. 3. Iterate: OT -> Procrustes -> OT -> ... Result: an EN-RU dictionary without bilingual data, built via GW/OT alignment.

Why is the initialization P_0 critical for the convergence of MUSE (OT + Procrustes)?

OT + Procrustes is a non-linear alternating optimization. With a random P_0 you can converge to a spurious alignment of language spaces (for instance, swapping semantic clusters).

Fused GW: Graph Matching and Molecular Comparison

**Fused Gromov-Wasserstein (FGW)** combines structural (GW) and content-based (classical OT) comparison. Formula: FGW = alpha * GW + (1 - alpha) * OT. For graphs: structure via shortest-path distances on the adjacency matrix, node features via feature vectors. Applications: molecular fingerprinting, graph classification, knowledge-graph alignment.

The POT (Python Optimal Transport) library provides ot.gromov.gromov_wasserstein and ot.gromov.fused_gromov_wasserstein with efficient implementations. Used in PyTorch Geometric for graph-level OT pooling and in chemoinformatics for molecular similarity.

In FGW = alpha * GW + (1 - alpha) * OT, what happens at alpha = 0 and alpha = 1?

alpha = 0: FGW = OT, we compare only node features and graph structure is ignored. alpha = 1: FGW = GW, we compare only intra-graph distances. Intermediate alpha balances the two criteria.

Key ideas

  • **GW distance:** min_P sum_{ijkl} (d_X(i, k) - d_Y(j, l))^2 * P_{ij} * P_{kl}. Preserves intra-space distances under the mapping. Quadratic in P.
  • **Algorithm:** Frank-Wolfe / projected gradient. Each step linearizes GW into a linear OT solved by Sinkhorn. No global-minimum guarantee.
  • **Word Mover Distance:** OT between bag-of-words with cost = L2 word2vec. Semantic distance without training.
  • **MUSE alignment:** OT -> Procrustes -> OT -> ... Aligning language embedding spaces without a dictionary.
  • **Fused GW:** alpha * GW + (1 - alpha) * OT for graphs with attributes. Molecules, knowledge graphs, social networks.

Related topics

Gromov-Wasserstein in the OT theory landscape:

  • Unbalanced OT — Previous lesson: OT when masses do not match
  • OT in NLP and generation — Next lesson: applications in language models and generative AI

Связанные уроки

  • ot-03-wasserstein — GW generalizes Wasserstein to different metric spaces
  • ot-10-gromov — Deeper analysis and applications
  • ot-15-unbalanced — Both relax the constraints of classical OT
Gromov-Wasserstein: OT Between Different Spaces

0

1

Sign In