Generative AI

Video Generation: Sora and Alternatives

On 15 February 2024, Twitter exploded with reactions: OpenAI showed a video of a Japanese woman walking through neon-lit Tokyo streets. A one-minute clip. Shadows move correctly, reflections in puddles stay consistent, the jacket texture is preserved as her head turns. Experts called it 'the GPT moment for video'. A year later Google answered with Veo 3, Runway with Gen-3, and Pika and Luma shipped their analogs. Video generation became possible not by training a bigger model but thanks to spacetime patches - the idea of processing video holistically rather than frame-by-frame.

  • **OpenAI Sora (2024)** - first model to generate one minute of 1080p video with realistic physics; internally a Diffusion Transformer on spacetime patches
  • **Runway Gen-3 / Pika 2.0** - production video in marketing; ads are fully AI-generated without a film crew
  • **Google Veo 3 (2025)** - 2-minute clips with camera and style control, integrated into YouTube Studio for creators

The Road to Sora

Video generation long lagged behind images because of the need for temporal consistency. In 2022 Meta introduced Make-A-Video and Google introduced Imagen Video, showing the first diffusion-based text-to-video results. In 2023 Runway released Gen-2, making short video generation available to a broad audience. The culmination came in February 2024, when OpenAI announced Sora, a model that generates one minute of consistent 1080p video with realistic physics through a Diffusion Transformer over spacetime patches. The moment was called the "GPT moment for video", and within a year Google (Veo), Runway (Gen-3), Pika, and Luma shipped their own alternatives.

Предварительные знания

  • Diffusion theory and the Diffusion Transformer
  • Stable Diffusion: latent diffusion and conditioning
  • Attention and patches (ViT) at a conceptual level
  • Diffusion Models: Theory
  • Stable Diffusion and DALL-E

Video Diffusion: 3D U-Net and Spacetime Patches

February 2024: OpenAI publishes Sora clips - a minute of video with realistic physics, shadows, and camera motion. Before that, the best result was capped at four seconds of 720p. What changed? Sora stopped treating video as a sequence of independent frames and started processing spacetime patches - cubes of pixels along two spatial axes and one temporal axis, sized 4x16x16. The model sees video as a whole, not frame-by-frame.

Architecture: Diffusion Transformer (DiT), but the patches are three-dimensional. The input is a tensor of shape [T, H, W, C], sliced into cubes, each turned into a token. Self-attention runs over all tokens simultaneously - spatial and temporal alike. Training uses the standard diffusion process: add noise to latents, predict it back. Data volume: millions of hours of video. Compute: thousands of GPU-days.

How do spacetime patches differ from per-frame video processing?

Temporal Consistency

Older text-to-video models (CogVideo, Make-A-Video 2022) generated 4 seconds, but a ball would randomly change color between frames and a face would morph smoothly into a different face. This is called **temporal flickering** - lack of consistency across time. Today's Sora, Veo, and Gen-3 solve the problem with three tricks: 3D-attention (the model sees every frame at once), a long context window (tens of seconds of video together), and special training data with physics annotations.

Evaluation metrics: FVD (Frechet Video Distance) is the video analog of FID; it measures distributional similarity between generated and real clips. CLIPSIM is per-frame semantic alignment with the prompt, averaged. Subject Consistency is the cosine similarity of DINO embeddings of the same object across frames. The more stable the embedding, the less flickering.

Why does 3D-attention handle flickering better than per-frame generation conditioned on the previous frame?

Long Videos and Hierarchical Generation

Sora showed one minute of video. Naively scaling the window to ten minutes is impossible: attention is quadratic, so 10x duration means 100x compute. The solution is **hierarchical generation**: first a low-resolution storyboard is produced (10 keyframes per minute), then each keyframe is interpolated into full video between two adjacent ones. This is how Veo 3 and Sora-2 build clips up to 5 minutes with a stable plot.

Plot vs physics. Short-term consistency (1-4 sec) is handled by local 3D-attention. Long-term consistency (1-5 min, e.g. 'a character walks left, then right, then left again') requires an explicit storyboard plus memory tokens. Modern models are trained on (long_video, storyboard_keyframes) pairs, which forces them to plan structure first and synthesize intermediate frames afterwards.

Why does the naive 'just enlarge the context window for long video' approach fail?

Editing: Inpainting, Outpainting, and Camera Control

Sora and Runway Gen-3 support not just from-scratch generation but also editing: video-to-video editing (recolor the hero's jacket to red across all frames), outpainting (extend the frame upward and downward), motion brush (set an object's motion direction). Technically this is inpainting in spacetime: mask the region that must change, run diffusion only inside that mask, and leave the rest untouched.

Camera control via ControlNet-style adapters: the input is a real video recording with a specific camera move (pan, dolly-in), and the model applies that motion to the generated scene. This is part of the VFX pipeline in cinema - a director can ship a reference 'camera move from a Bond film' and apply the same motion to a synthetic world.

Video diffusion is just image diffusion applied to each frame independently

Video diffusion requires joint modeling of space and time - 3D-attention, spacetime patches, and hierarchy for long clips. A per-frame approach causes flickering and unusable quality

Per-frame processing loses temporal links: an object can change color, shape, or position between frames. Spacetime modeling is the principled difference between modern video generation and the naive approaches of 2022

Why is video inpainting harder than image inpainting?

Key Ideas

  • **Spacetime patches** - 4x16x16 cubes from the video tensor that bundle pixels in time and space; the foundation of Sora's architecture
  • **3D-attention** - the model sees the full temporal context simultaneously, removing flickering and ensuring consistency
  • **Hierarchical generation** - storyboard keyframes plus interpolation; sidesteps the quadratic blowup of attention for long video
  • **Spacetime inpainting/outpainting** - enables editing without losing temporal consistency; the basis of production VFX pipelines

Related Topics

Back to motivation: video generation is not a standalone phenomenon but a continuation of image diffusion ideas. Links to earlier lessons:

  • Music and Audio Generation — Similar temporal-consistency problem, but in 1D audio; video spacetime patches generalize audio spectrogram patches
  • Diffusion Models — Video diffusion extends base diffusion to 3D data; the principles of the noise process remain the same
  • Vision Transformer (ViT) — The patch-embedding idea from ViT generalizes directly to spacetime patches; the DiT architecture inherits this approach

Вопросы для размышления

  • If spacetime patches solve flickering thanks to 3D-attention, why can the model not simply keep expanding the temporal window and generate hour-long videos?
  • Which industries (film, advertising, gamedev, education) will change first once 5-minute AI videos become routine, and why?
  • Back to Sora: the Japanese-woman clip looks realistic, yet on close inspection there are artifacts in the shadows. Which limitations of spacetime diffusion does that reveal?

Связанные уроки

  • gai-09 — Video diffusion extends image diffusion over time
  • gai-13 — Audio generation is a component of the video pipeline
  • gai-11 — Advanced image models are the foundation of video generation
  • gai-19 — Inference optimization is critical for video due to size
  • ds-01-intro — Temporal consistency mirrors consistency in distributed systems
  • nlp-10 — Temporal attention is the same mechanism as sequence attention in NLP
  • dl-01
Video Generation: Sora and Alternatives

0

1

Sign In