Generative AI

Image Generation: Advanced

In 2023, SDXL-Turbo generated a high-quality image in 150 ms - roughly the time it takes to blink. A year later, Sora produced one-minute videos. Diffusion generation speed grew 300x in two years.

  • **Consistency Models in mobile apps:** Lensa and similar apps use 4-step generation directly on a smartphone without a server call
  • **Stable Video Diffusion in advertising:** generating product showcase videos from a single product photo - agencies replace video shoots with AI generation
  • **Shap-E in e-commerce:** automatic 3D model generation for AR try-on without 3D artists

The Race for Speed and Control

By 2023 diffusion had moved from research curiosity to industry. In February 2023 Lvmin Zhang, Anyi Rao, and Maneesh Agrawala introduced ControlNet, adding structural control to Stable Diffusion through depth maps, edges, and pose. In the summer of 2023 Stability AI released SDXL, a model with markedly higher quality and resolution. OpenAI answered with DALL-E 3, tightly integrating generation into ChatGPT. In parallel, distillation (Latent Consistency Models, SDXL-Turbo) cut the step count from dozens to one through four, bringing generation close to real time.

Предварительные знания

  • Basic diffusion theory and DDPM
  • Stable Diffusion: latent diffusion and conditioning
  • ControlNet and img2img at a conceptual level
  • Diffusion Models: Theory
  • Stable Diffusion and DALL-E

Consistency Models

Diffusion models generate images in 50-1000 denoising steps - beautiful but slow. Consistency Models solve this radically: a single function f(x_t, t) = x_0 must produce the same result for any point on the denoising trajectory. This self-consistency property allows generating an image in 1-4 steps instead of hundreds.

Consistency Training trains the model directly from scratch. Consistency Distillation uses an existing diffusion model as a teacher. Distillation is faster and more stable - it is what most production systems use.

What key property of Consistency Models enables single-step generation?

Turbo Distillation

SDXL-Turbo and Stable Diffusion 3 Turbo use Adversarial Diffusion Distillation (ADD) - a hybrid of distillation and GAN training. The student model learns simultaneously: minimize the distillation loss from the diffusion teacher and fool a discriminator. The result is a 4-step model with quality indistinguishable from the 50-step version. Latency drops from 3 seconds to 150 ms on a single A100.

SDXL-Turbo is trained for 4 steps but produces acceptable quality even at 1 step. LCM-LoRA (Latent Consistency Model LoRA) applies the speedup to any fine-tuned SD variant without full retraining.

Why must guidance_scale be set to 0.0 when using SDXL-Turbo?

Video Generation

Video generation is image generation plus temporal consistency. Sora, Stable Video Diffusion, and Runway Gen-3 use different architectures: DiT (Diffusion Transformer) with 3D attention over space and time, or 2D diffusion with temporal attention layers. The main challenge is memory: 16 frames at 1024x576 in float16 take roughly 3 GB for activations alone.

Stable Video Diffusion generates 25 frames from a single image. CogVideoX and Open-Sora are open-source alternatives with public weights. Sora from OpenAI uses spacetime patches - the 3D analogue of image patches across both space and time.

What are spacetime patches in Sora's architecture?

3D Synthesis

3D generation is undergoing a revolution through Neural Radiance Fields (NeRF) and 3D Gaussian Splatting. DreamFusion trains a NeRF via Score Distillation Sampling (SDS) - the diffusion model acts as a critic scoring renders from different viewpoints. Shap-E and Point-E from OpenAI generate 3D directly in seconds. Gaussian Splatting is an order of magnitude faster than NeRF at comparable quality.

Score Distillation Sampling is the key trick: instead of generating a 3D object directly, a diffusion model is used as a source of gradients. The NeRF is optimized so that renders from any angle look plausible according to the 2D diffusion model.

3D Gaussian Splatting is just 'fast NeRF'

Gaussian Splatting is a fundamentally different representation: the scene is defined by millions of Gaussians with position, shape, color and opacity - not a neural network mapping coordinates to density and color

NeRF is an implicit representation (neural network), Gaussian Splatting is explicit (parametric primitives). Splatting rasterizes Gaussians directly on the GPU without ray marching, achieving 100+ FPS rendering vs. seconds for NeRF

What is Score Distillation Sampling in the context of 3D generation?

Key Ideas

  • **Consistency Models:** self-consistency f(x_t, t) = x_0 enables generation in 1-4 steps instead of 50-1000
  • **ADD distillation:** SDXL-Turbo combines distillation and GAN - 4 steps with less than 10% quality gap vs. full SDXL
  • **Video and 3D:** DiT with 3D attention for video, SDS + NeRF/Gaussian Splatting for 3D - diffusion has moved beyond static images

Related Topics

Advanced generation techniques build on the basics of diffusion models:

  • Diffusion Models — Consistency Models and Turbo are acceleration methods for classical diffusion via distillation
  • ControlNet and Fine-tuning — LCM-LoRA applies the speedup to fine-tuned models without full retraining

Вопросы для размышления

  • Consistency Distillation requires an existing diffusion model as a teacher. What limitations does this impose - can a CD model surpass its teacher in quality?
  • Sora generates videos up to one minute long. What technical barriers prevent generating full-length films - and are they deep or temporary?
  • Score Distillation Sampling uses a 2D diffusion model to optimize 3D. What artifacts arise from the model having seen only 2D images rather than real 3D objects?

Связанные уроки

  • gai-10 — Builds directly on Stable Diffusion pipelines
  • gai-14 — Image techniques generalize to video generation
  • aie-27-video-audio-generation — Production view of advanced video and audio synthesis
  • cv-16 — 3D and video synthesis sit in the vision domain
  • gai-19 — Turbo distillation is inference-speed optimization
  • dl-01
Image Generation: Advanced

0

1

Sign In