Generative AI
Stable Diffusion and DALL-E
August 2022: Stability AI releases Stable Diffusion publicly. Millions download it within a week. For the first time, a production-grade text-to-image model runs on a consumer GPU. Which architectural choices made that possible?
- **Stable Diffusion:** latent diffusion + CLIP + U-Net, open weights
- **DALL-E 3:** T5-XXL encoder + GPT-4 prompt rewriting, closed OpenAI model
- **Midjourney:** proprietary architecture, optimized for aesthetic quality
- **Adobe Firefly:** trained exclusively on licensed content for commercial use
- **ControlNet:** hundreds of specialized controllers from the open-source community
The Year Text-to-Image Went Mainstream
In January 2021 OpenAI introduced the first DALL-E, a model that drew images from text descriptions using a discrete VAE and an autoregressive transformer. In 2022 DALL-E 2 switched to diffusion and CLIP latents, sharply raising quality. In parallel Google showed Imagen, demonstrating the power of large text encoders. The decisive shift came in 2022: Robin Rombach and colleagues at LMU Munich, together with Stability AI, published Latent Diffusion Models, and Stability AI released Stable Diffusion with open weights. For the first time a powerful text-to-image model ran on a consumer GPU, which spawned a vast ecosystem of fine-tunes and tooling.
Предварительные знания
- Diffusion theory: forward/reverse process, noise prediction
- VAEs and latent representations
- Cross-attention and text embeddings
Latent Diffusion: Diffusion in Latent Space
DDPM operates directly in pixel space - for a 512x512 image that is a 786,432-dimensional space. Diffusion there is slow and memory-intensive. **Latent Diffusion Models (LDM)** - Rombach et al. (2022): first compress the image with a VAE encoder into a 64x64x4 latent space (48x smaller), then run diffusion there.
The VAE is trained separately and frozen. Its encoder E maps an image to a latent z = E(x); its decoder D reconstructs the image x' = D(z). The diffusion model works only with z. This separation is the core architectural idea behind Stable Diffusion.
**Why 4 channels in the latent?** The VAE encodes spatial information into a 4-dimensional vector per spatial patch. These are not RGB channels - they are learned features. The first channels typically encode luminance and texture; the rest encode color and shape detail.
Why does Stable Diffusion run diffusion in latent space rather than pixel space?
CLIP and Text Conditioning
How does a text prompt guide the U-Net? Through **CLIP** (Contrastive Language-Image Pre-training, OpenAI 2021). CLIP was trained on 400M image-text pairs so that similar pairs are close in a shared embedding space. Its text encoder converts a prompt into a vector that the diffusion model can condition on.
Inside the U-Net, the text embedding is fed through **cross-attention**: queries Q come from the image latent, keys K and values V come from the text embedding. Each spatial patch of the latent can then "attend to" the text and be shaped by it.
**Classifier-Free Guidance (CFG)** amplifies text adherence. The model runs twice: once with the text embedding and once with an empty prompt. The final noise prediction is: empty_pred + guidance_scale * (text_pred - empty_pred). At guidance_scale=7.5 the image follows the prompt strongly; at 15+ it does so at the cost of diversity.
**DALL-E 3 vs Stable Diffusion:** OpenAI replaced CLIP with T5-XXL (a text-only transformer) as the text encoder, and added GPT-4 prompt rewriting before generation. This significantly improved adherence to long, detailed prompts.
How does a text prompt influence generation in Stable Diffusion?
ControlNet: Structured Generation
A text prompt is an imprecise controller. Describing a person's exact pose, a scene's depth layout, or an object's silhouette in words is difficult. **ControlNet** (Zhang et al., 2023) adds structural control: the model additionally takes a depth map, Canny edges, OpenPose keypoints, or a sketch as input.
The ControlNet architecture is a copy of the U-Net's encoder trained to accept an additional conditioning input. Its outputs are added to the corresponding U-Net layers via **zero-convolutions** - convolutional layers initialized to zero. This allows training only the ControlNet while leaving the original model untouched.
**IP-Adapter** is a similar idea for images: it adds image-based conditioning via a separate image encoder and cross-attention. This enables "redraw this face/object in a different style" without fine-tuning the base model.
What are zero-convolutions in the ControlNet architecture?
img2img and Inpainting
**img2img** starts generation from a noised version of a real image rather than from pure Gaussian noise. Instead of starting at x_T ~ N(0,I), the process starts at x_{t_start} = noised input image at step t_start. Low t_start (little noise) keeps the image nearly unchanged; high t_start produces strong transformation.
**Inpainting** regenerates only masked regions while preserving the rest. A mask specifies which pixels should be replaced. During denoising, unmasked pixels are taken from the original; masked pixels are generated by the model conditioned on the surrounding context.
**Outpainting** extends the image beyond its borders. Technically it is the same as inpainting: the original is placed in the center of a larger canvas, the surrounding area is masked, and the model generates continuation conditioned on the existing content.
The parameter strength=0.3 in img2img means:
Stable Diffusion and DALL-E
- Latent Diffusion: VAE compresses images 48x; diffusion runs in the smaller latent space
- CLIP encodes text into embeddings; the U-Net accesses them through cross-attention
- CFG (Classifier-Free Guidance): double forward pass to strengthen text adherence
- ControlNet: structural control (edges, depth, pose) via zero-convolution copy of the U-Net encoder
- img2img: denoising starts from a noised real image rather than pure noise
- Inpainting / outpainting: regenerate masked regions conditioned on surrounding context
Related Topics
Stable Diffusion combines several architectures - understanding each gives the complete picture.
- Diffusion Models: Theory — Mathematical foundation of Stable Diffusion
- VAE: Variational Autoencoders — The latent space where diffusion operates
- CLIP and Multimodal Models — Text encoder for conditioning
Вопросы для размышления
- Stability AI released weights openly; OpenAI keeps DALL-E closed. What are the trade-offs of each approach?
- How could ControlNet be applied to domains other than image generation?
- What tasks remain difficult for diffusion models even with all the control mechanisms covered here?
Связанные уроки
- gai-09 — Latent diffusion applies the DDPM theory directly
- gai-11 — Base pipeline that advanced generation extends
- aie-26-image-generation — Production image generation built on these systems
- cv-13 — CLIP conditioning links vision and language representations
- ml-32-autoencoders — The VAE compresses images into the latent space
- dl-01