Generative AI

Stable Diffusion and DALL-E

August 2022: Stability AI releases Stable Diffusion publicly. Millions download it within a week. For the first time, a production-grade text-to-image model runs on a consumer GPU. Which architectural choices made that possible?

**Stable Diffusion:** latent diffusion + CLIP + U-Net, open weights
**DALL-E 3:** T5-XXL encoder + GPT-4 prompt rewriting, closed OpenAI model
**Midjourney:** proprietary architecture, optimized for aesthetic quality
**Adobe Firefly:** trained exclusively on licensed content for commercial use
**ControlNet:** hundreds of specialized controllers from the open-source community

The Year Text-to-Image Went Mainstream

In January 2021 OpenAI introduced the first DALL-E, a model that drew images from text descriptions using a discrete VAE and an autoregressive transformer. In 2022 DALL-E 2 switched to diffusion and CLIP latents, sharply raising quality. In parallel Google showed Imagen, demonstrating the power of large text encoders. The decisive shift came in 2022: Robin Rombach and colleagues at LMU Munich, together with Stability AI, published Latent Diffusion Models, and Stability AI released Stable Diffusion with open weights. For the first time a powerful text-to-image model ran on a consumer GPU, which spawned a vast ecosystem of fine-tunes and tooling.

Предварительные знания

Diffusion theory: forward/reverse process, noise prediction
VAEs and latent representations
Cross-attention and text embeddings

Latent Diffusion: Diffusion in Latent Space

DDPM operates directly in pixel space - for a 512x512 image that is a 786,432-dimensional space. Diffusion there is slow and memory-intensive. **Latent Diffusion Models (LDM)** - Rombach et al. (2022): first compress the image with a VAE encoder into a 64x64x4 latent space (48x smaller), then run diffusion there.

The VAE is trained separately and frozen. Its encoder E maps an image to a latent z = E(x); its decoder D reconstructs the image x' = D(z). The diffusion model works only with z. This separation is the core architectural idea behind Stable Diffusion.

**Why 4 channels in the latent?** The VAE encodes spatial information into a 4-dimensional vector per spatial patch. These are not RGB channels - they are learned features. The first channels typically encode luminance and texture; the rest encode color and shape detail.

Why does Stable Diffusion run diffusion in latent space rather than pixel space?

CLIP and Text Conditioning

How does a text prompt guide the U-Net? Through **CLIP** (Contrastive Language-Image Pre-training, OpenAI 2021). CLIP was trained on 400M image-text pairs so that similar pairs are close in a shared embedding space. Its text encoder converts a prompt into a vector that the diffusion model can condition on.

Inside the U-Net, the text embedding is fed through **cross-attention**: queries Q come from the image latent, keys K and values V come from the text embedding. Each spatial patch of the latent can then "attend to" the text and be shaped by it.

**Classifier-Free Guidance (CFG)** amplifies text adherence. The model runs twice: once with the text embedding and once with an empty prompt. The final noise prediction is: empty_pred + guidance_scale * (text_pred - empty_pred). At guidance_scale=7.5 the image follows the prompt strongly; at 15+ it does so at the cost of diversity.

**DALL-E 3 vs Stable Diffusion:** OpenAI replaced CLIP with T5-XXL (a text-only transformer) as the text encoder, and added GPT-4 prompt rewriting before generation. This significantly improved adherence to long, detailed prompts.

How does a text prompt influence generation in Stable Diffusion?

ControlNet: Structured Generation

A text prompt is an imprecise controller. Describing a person's exact pose, a scene's depth layout, or an object's silhouette in words is difficult. **ControlNet** (Zhang et al., 2023) adds structural control: the model additionally takes a depth map, Canny edges, OpenPose keypoints, or a sketch as input.

The ControlNet architecture is a copy of the U-Net's encoder trained to accept an additional conditioning input. Its outputs are added to the corresponding U-Net layers via **zero-convolutions** - convolutional layers initialized to zero. This allows training only the ControlNet while leaving the original model untouched.

**IP-Adapter** is a similar idea for images: it adds image-based conditioning via a separate image encoder and cross-attention. This enables "redraw this face/object in a different style" without fine-tuning the base model.

What are zero-convolutions in the ControlNet architecture?

img2img and Inpainting

**img2img** starts generation from a noised version of a real image rather than from pure Gaussian noise. Instead of starting at x_T ~ N(0,I), the process starts at x_{t_start} = noised input image at step t_start. Low t_start (little noise) keeps the image nearly unchanged; high t_start produces strong transformation.

**Inpainting** regenerates only masked regions while preserving the rest. A mask specifies which pixels should be replaced. During denoising, unmasked pixels are taken from the original; masked pixels are generated by the model conditioned on the surrounding context.

**Outpainting** extends the image beyond its borders. Technically it is the same as inpainting: the original is placed in the center of a larger canvas, the surrounding area is masked, and the model generates continuation conditioned on the existing content.

The parameter strength=0.3 in img2img means: