Generative AI

Diffusion Models: Theory

2020: GANs dominate image generation - but training is unstable, mode collapse is common, and balancing the generator and discriminator is notoriously tricky. Then Ho et al. publish DDPM and show results that surpass the best GANs, with none of their pathologies. The secret lies in the physical process of diffusion, run in reverse.

**Stable Diffusion, DALL-E 3, Midjourney** - all built on diffusion models
**Drug discovery:** AlphaFold 3 generates molecular structures via diffusion
**Video generation:** Sora, Gen-3 - diffusion across space and time
**Audio:** Stable Audio, MusicGen - diffusion over spectrograms

From Thermodynamics to DDPM

The idea of diffusion models was born in 2015: Jascha Sohl-Dickstein and colleagues, in "Deep Unsupervised Learning using Nonequilibrium Thermodynamics", proposed training a generative model by reversing a gradual data-corruption process. The approach drew on nonequilibrium thermodynamics and stayed niche for nearly five years. In 2019 Yang Song and Stefano Ermon independently reached a closely related idea through score-based generative models (estimating the gradient of the data log-density). The turning point was 2020, when Jonathan Ho, Ajay Jain, and Pieter Abbeel published DDPM and showed that diffusion produces images of higher quality than the best GANs. DDPM is what launched the wave that led to Stable Diffusion and DALL-E.

Предварительные знания

Understanding of VAEs and latent spaces
Basic convolutional networks and U-Net
Attention and transformers at a conceptual level

DDPM: Denoising Diffusion Probabilistic Models

Take a photograph and add a tiny amount of Gaussian noise each iteration. After 1000 iterations, the result is indistinguishable white noise. DDPM asks: can a neural network reverse this process? It turns out the answer is yes - by training the network to predict exactly which noise was added at each step.

The **forward process** is a fixed Markov chain: at step t, Gaussian noise is added to x_{t-1} to produce x_t. A key property makes training efficient: x_t at any step t can be obtained directly from x_0 without iterating through all intermediate steps, via a reparameterization using alpha_bar_t = ∏ (1 - beta_i).

**Why predict noise instead of x_0?** Ho et al. (2020) showed that predicting eps yields better results. The intuition: noise is local and high-frequency, which is easier for a network to predict than the global structure of a full image.

What does the U-Net in DDPM predict at each denoising step?

Score Matching and SMLD

There is an alternative perspective on diffusion: the **score function**. Score is the gradient of the log-density of the data distribution: s(x) = ∇_x log p(x). Knowing the score at every point allows moving toward higher-density regions - which is exactly what image generation from noise requires.

**Score Matching with Langevin Dynamics (SMLD)** - Yang Song et al. (2019): train a network s_θ(x, σ) ≈ ∇_x log p_σ(x). For generation, run Langevin dynamics iterations - drift along the score plus small random noise. Song et al. (2021) proved that DDPM and score-based models are mathematically equivalent: both are discretizations of an SDE.

**DDPM and score-based models via SDE.** Song et al. (2021) unified both frameworks: any diffusion process is a solution to an SDE of the form dx = f(x,t)dt + g(t)dW. Its reverse is also an SDE that requires the score. This opened the path to continuous-time diffusion and flow matching.

What is the score function in the context of diffusion models?

Noise Schedule: Linear vs Cosine

The noise schedule determines how quickly corruption builds up - specifically, how alpha_bar_t falls from 1 to ~0 across T steps. The choice of schedule affects both training quality and inference efficiency.

The **linear schedule** (original DDPM) increases beta_t linearly from 0.0001 to 0.02. The problem: at small resolutions, the image becomes pure noise by the halfway point - the remaining 500 steps train on a meaningless signal. The **cosine schedule** (Nichol & Dhariwal, 2021) fixes this with a smoother progression.

**Flow Matching (2022-2023)** goes further: straight-line trajectories from noise to data replace diffusion curves. This reduces required steps from 1000 to 20-50 without quality loss. Stable Diffusion 3 and FLUX both use Flow Matching.

The cosine schedule outperforms the linear schedule because:

Denoising via U-Net

The diffusion math needs a concrete backbone - a network that takes a noisy image x_t and a step index t, returning the predicted noise eps. DDPM uses a **U-Net**: an encoder-bottleneck-decoder architecture with skip connections that preserve spatial detail across scales.

The step index t is injected via a **sinusoidal time embedding** - the same mechanism as positional encoding in transformers. This lets the model adapt: at large t (heavy noise) it focuses on coarse structure; at small t it refines fine details.

**DDIM sampling.** Standard DDPM inference requires 1000 steps. DDIM (Song et al. 2020) showed that the same trained model can generate in 50-100 deterministic steps - 10-20x faster. DDIM also enables latent interpolation: smooth transitions between images in the noise space.

Why does the U-Net in DDPM receive the step index t as an additional input?