Optimal Transport

Wasserstein GAN: a metric that works

GAN training in 2014-2017 was a lottery. WGAN turned it into engineering. Wasserstein distance gave the first honest quality metric for a generative model.

**StyleGAN2 (NVIDIA, 2019)** uses path-length regularization - an extension of the WGAN-GP idea. Generates $1024 \times 1024$ FFHQ faces and remains a baseline for face synthesis
**BigGAN (DeepMind, 2019)** is built on spectral normalization - an alternative way to enforce Lipschitz, conceptually close to WGAN. $512 \times 512$ ImageNet with FID 7.4
**FID benchmarks 2017-2018**: WGAN-GP was the first to demonstrate monotonic improvement of loss vs FID - prior to this the correlation between training metrics and sample quality was essentially zero

Предварительные знания

Kantorovich-Rubinstein duality and 1-Lipschitz functions (ot-05-dual)
Wasserstein $W_p$ as a metric on distributions (ot-03-wasserstein)
Basic GAN architectures: generator and discriminator

Instability of classical GANs

In 2014, Goodfellow proposed the GAN: a generator $G$ and a discriminator $D$ play a minmax game around the JS divergence. The loss looks elegant: $\min_G \max_D \mathbb{E}_{x \sim p_r}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$. A year later DCGAN generated convincing faces on CelebA, and the industry fell in love with the architecture. Two years later it became apparent: training was a lottery.

**Three symptoms of instability in classical GANs**: 1. **Vanishing gradients**. When $D$ is too strong, it saturates the sigmoid and outputs nearly 0 on fakes. The gradient of $G$ parameters through $\log(1 - D(G(z)))$ becomes exponentially small - the generator stops learning. 2. **Mode collapse**. $G$ finds one mode where $D$ is confused and replicates it. The multimodal distribution $p_r$ is approximated by a single point - visually, all faces look alike. 3. **Convergence chaos**. The JS-loss oscillates without a clear trend. There is no meaningful quality metric - one has to inspect samples by eye every $N$ iterations.

The root of the problem lies in the JS divergence itself. If $p_r$ and $p_g$ live on lower-dimensional manifolds (the typical situation for natural images in $\mathbb{R}^{H \times W \times 3}$), their supports are almost surely disjoint. On disjoint supports, $\text{JS}(p_r \| p_g) = \log 2$ - a constant. The derivative of a constant is zero. The generator stalls because the loss function does not distinguish between «barely missed» and «far off».

**Manifold hypothesis vs JS**: real images live on a manifold of dimension $\sim 50$ inside $\mathbb{R}^{3 \cdot 256 \cdot 256} \approx \mathbb{R}^{200000}$. The probability that two independent samples $G(z_1)$ and $x \sim p_r$ fall into the same $\varepsilon$-ball is exponentially small. So the supports of $p_r$ and $p_g$ almost never overlap at the start of training - and JS becomes useless.

What exactly causes mode collapse in a classical GAN?

Wasserstein loss: critic instead of discriminator

January 2017. Arjovsky, Chintala, and Bottou publish «Wasserstein GAN». The idea is simple and radical: replace JS with $W_1$. Through Kantorovich-Rubinstein duality, $W_1$ rewrites as a supremum over 1-Lipschitz functions - and this rewrites the entire GAN formulation. The discriminator becomes a critic, the sigmoid disappears, training stabilizes.

**Wasserstein-1 distance via duality**: $$W_1(p_r, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_r}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]$$ Where $\|f\|_L \leq 1$ means 1-Lipschitz: $|f(x) - f(y)| \leq \|x - y\|$. The function $f$ in WGAN is called the **critic** - it scores how «real» an input looks, returning a scalar (not a probability). WGAN losses: $$\mathcal{L}_{\text{critic}} = \mathbb{E}[f(G(z))] - \mathbb{E}[f(x)], \quad \mathcal{L}_{\text{gen}} = -\mathbb{E}[f(G(z))]$$ The critic maximizes the gap between real and fake; the generator minimizes it.

The principal difference from a discriminator: the output is unbounded (not constrained to $[0, 1]$) and is not a probability. It is just a scalar score. The sigmoid is removed - along with saturation, vanishing gradients, and the log in the loss. On disjoint supports, $W_1$ equals the geometric distance between them, and the gradient w.r.t. $G$ parameters is always meaningful, always nonzero.

**Weight clipping** is the original way to enforce Lipschitz in WGAN. After each critic update, all parameters are clipped to $[-c, c]$ (typically $c = 0.01$). Crude but effective: it guarantees $K$-Lipschitz with some $K$ that depends on architecture and $c$. The authors themselves admitted in the paper that this is a a deeply imperfect approach - but in 2017 it worked well enough for a breakthrough on FID benchmarks.

The practical advantage: $-\mathcal{L}_{\text{critic}}$ correlates with sample quality. If the loss drops, the generator is genuinely improving. This is the first GAN objective that gives a meaningful training metric. Before WGAN, sample quality was judged by eye or via separate FID/IS computations; after WGAN, training can be monitored in real time.

How does the WGAN critic principally differ from a classical GAN discriminator?

Gradient penalty: smart Lipschitz

Weight clipping works, but bluntly. It clips all parameters uniformly without regard to network structure, often reducing critic capacity - after clipping, weights cluster at the boundaries $\pm c$ and activations saturate. A few months after the original WGAN, the same group (Gulrajani et al., 2017) released WGAN-GP with a fundamentally better idea.

**Gradient penalty** replaces weight clipping with a regularizer in the critic loss: $$\mathcal{L}_{\text{critic}}^{\text{GP}} = \mathbb{E}[f(G(z))] - \mathbb{E}[f(x)] + \lambda \mathbb{E}_{\hat{x}}\left[\left(\|\nabla_{\hat{x}} f(\hat{x})\|_2 - 1\right)^2\right]$$ Where $\hat{x} = t \cdot x + (1 - t) \cdot G(z)$ with $t \sim U[0,1]$ - a sample on the line segment between real and fake. Typically $\lambda = 10$. Idea: instead of clipping weights (which enforces Lipschitz indirectly), directly penalize the critic when the gradient norm deviates from 1.

Theoretical justification: the optimal critic $f^*$ from Kantorovich-Rubinstein duality has $\|\nabla f^*\| = 1$ almost everywhere on the support of the optimal transport plan. This is a characterization of the optimal Kantorovich potential. The regularizer toward gradient norm = 1 therefore not only enforces Lipschitz but also pushes the critic toward its optimal form - gradient descent moves $f$ toward the dual problem solution.

Alternative ways to enforce Lipschitz appeared later. **Spectral Normalization** (Miyato et al., 2018) divides each weight matrix by its spectral norm - this gives a hard guarantee of 1-Lipschitz without any regularizer term. Used in BigGAN (Brock et al., 2019) for $512 \times 512$ ImageNet generation. Spectral norm is cheaper than GP in compute, but less flexible - GP allows the critic to be $K$-Lipschitz for any $K$, not strictly 1.

**Path-length regularization** in StyleGAN2 (Karras et al., 2019) extends the WGAN-GP idea to feature space. Instead of $\|\nabla f\| = 1$ in input space, the regularizer requires that a small perturbation in latent $z$ produce a constant-norm change in pixel space. This brought improvements in perceptual quality on $1024 \times 1024$ FFHQ faces. NVIDIA still uses this scheme in commercial models.

**Misconception**: WGAN-GP completely solves GAN training. Reality: WGAN-GP is more stable than classical GANs, but not a magic bullet. Modern practice 2023-2026: for high-resolution synthesis, diffusion models (DDPM, Karras 2022; Stable Diffusion) and flow matching (FLUX.1) deliver better FID and diversity. WGAN-GP remains important as a baseline and in narrow-domain tasks (medical imaging, low-data regimes), but it is not state-of-the-art for general-purpose generation.

WGAN-GP completely solves all problems of GAN training - mode collapse and instability can be forgotten forever

WGAN-GP significantly improves stability but is not a panacea. Mode collapse becomes rare yet still possible. Hyperparameter tuning is still needed. More importantly, in high-resolution synthesis (2023-2026), diffusion models and flow matching surpass GAN approaches in quality and stability.

Key ideas

Classical GANs suffer from vanishing gradients, mode collapse, and convergence chaos - the root is JS divergence, which is constant on disjoint supports
WGAN replaces JS with $W_1$ via Kantorovich-Rubinstein duality. The discriminator becomes a 1-Lipschitz critic with scalar output, and the loss correlates with sample quality
Original WGAN uses weight clipping to enforce Lipschitz - crude but functional. WGAN-GP replaces clipping with the gradient penalty $\lambda \mathbb{E}[(\|\nabla f\| - 1)^2]$, theoretically guiding the critic toward the optimal Kantorovich potential
Modern alternatives: spectral normalization (BigGAN), path-length regularization (StyleGAN2). Diffusion models and flow matching outperform WGAN on high-resolution tasks in 2023-2026, but WGAN-GP remains a baseline and is important in low-data regimes

Вопросы для размышления

Why is $W_1$ rather than $W_2$ chosen for the WGAN loss? What role does Kantorovich-Rubinstein duality play in this choice?
Can gradient penalty guarantee a strictly 1-Lipschitz critic, or is it only a soft constraint? What happens if $\|\nabla f\|$ is systematically greater than 1?
Why do diffusion models (DDPM 2020) outperform WGAN on high-resolution tasks? What is the structural advantage of a stochastic noise schedule over adversarial training?

Связанные уроки

ot-03-wasserstein — WGAN is a direct application of the $W_1$ metric as a loss function
ot-05-dual — Kantorovich-Rubinstein duality justifies the critic as a 1-Lipschitz function
ot-11-flow-matching — Flow matching is a modern alternative to GANs built on the same OT ideas
ig-11-wasserstein-vs-fisher — Comparison of Wasserstein and Fisher geometries in training
prob-01-intro — JS and KL divergences are the foundation of classical GANs
ml-01