AI Engineering

The Future: World Models - AI That Understands Physics, Causality, and Time

Цели урока

Understand the limitations of language models in understanding the physical world
Grasp the approaches to video prediction: Sora (generative) vs Genie (interactive)
Learn the idea behind JEPA - prediction in representation space rather than pixel space
Distinguish correlation from causation and understand why it's critical for AI

Genie 2 (DeepMind, December 2024) takes a single photograph - any photograph - and within seconds unfolds it into an interactive 3D world: press "forward" and the camera moves, parallax is correct, press "jump" and gravity kicks in. No physics engine written by hand. The model learned the laws of the world from video alone. This is no longer a language model - it is a simulator of reality.

Genie 2 (DeepMind): one photograph becomes an interactive world - keypresses, physics, parallax. A potential replacement for game engines without writing a single line of physics code
Sora (OpenAI): 60-second photorealistic video from a text prompt via Diffusion Transformer - a revolution in advertising and filmmaking
Tesla FSD: the neural network drives via an internal world model - predicting pedestrian and vehicle behavior on a ~3-second horizon
V-JEPA (Meta): state-of-the-art action recognition without a single text annotation - learned purely from observation of video sequences

The world-models research line

In March 2018 David Ha and Jurgen Schmidhuber published "World Models", showing an agent could learn a compressed internal model of its environment (a VAE plus a recurrent network) and then plan inside that imagined world. In 2022 Yann LeCun argued in "A Path Towards Autonomous Intelligence" that prediction should happen in abstract latent space rather than over raw pixels, introducing JEPA, later realized as I-JEPA and V-JEPA at Meta. In February 2024 DeepMind's Genie generated playable 2D worlds from unlabeled video, and Genie 2 (December 2024) extended this to interactive 3D scenes. The shared bet: real understanding comes from modeling how the world changes, not just predicting the next token.

Предварительные знания

Multimodal AI: Vision, Audio, Documents - One API for Everything

From Language Models to World Models

LLMs are trained on text and understand the world *through language*. But language is a compressed, symbolic description of reality. The sentence "the ball falls to the floor" contains no trajectory, no velocity, no elasticity. GPT-4 knows the right answer from billions of texts - but doesn't compute it. A **world model** is an AI system that builds an internal representation of the physical world and predicts the next state, the way Sora predicts the next video frame or Tesla FSD predicts pedestrian movement 3 seconds ahead.

Yann LeCun (Chief AI Scientist, Meta) is the leading advocate for world models. His position: **LLMs will never achieve human-level AI** because text contains a negligible fraction of information about the world. By age 4, a child receives the equivalent of `10^15` bytes of sensory data through vision - more than all the text on the internet. This is precisely why Meta invests in V-JEPA and multimodal architectures rather than simply scaling LLaMA.

A large language model trained on all the text ever written will still understand less about the physical world than a house cat.

Aspect	Language Model	World Model
Training data	Text (terabytes)	Video, sensory data (petabytes)
Physics understanding	Through textual descriptions	Through observation and prediction
Causality	Correlations in text	Cause-and-effect relationships
Planning	In text space	In action and state space
Examples	GPT-4, Claude, Gemini	Sora, Genie, JEPA (research)

**Counterargument:** proponents of the LLM approach (including Ilya Sutskever, former Chief Scientist at OpenAI) argue that a sufficiently large language model will *inevitably* learn the physics of the world from text - because text describes physics. This debate is one of the central ones in AI research.

What is the fundamental limitation of language models according to Yann LeCun?

Video Prediction: Sora, Genie, and Simulating Reality

Video generation is an intermediate step toward world models. If a model predicts the next frame well enough, it *must* have an internal physics model: objects don't pass through walls, water flows downhill, shadows follow the light source. This is the core logic behind Sora: train on millions of hours of video and let physics emerge. The open question is whether what emerges is real physics or just very convincing correlations.

**OpenAI Sora** (February 2024, public release December 2024) is a generative video model trained on millions of hours of video content. The architecture is a Diffusion Transformer (DiT): a combination of a diffusion process (as in DALL-E) and Transformer attention.

**Google Genie** (February 2024) and **Genie 2** (December 2024, DeepMind) take a fundamentally different approach. Sora generates a fixed video - a film to watch. Genie creates **interactive worlds**: a keypress changes the state of the environment, and the model predicts the consequences. In effect, a game engine learned from video. One screenshot of a room - and it becomes possible to walk around it, open a drawer, knock something off a table.

**Is Sora a world model?** This is the subject of active debate. Tim Brooks (one of Sora's creators) claims yes - the model learned a "simulator of the physical world." Critics point to systematic physics violations: objects appear from nowhere, liquids behave incorrectly. Most likely, Sora learned powerful visual priors, but not a true physical model.

For AI engineers, video generation opens concrete applications right now. The most valuable: synthetic training data for robotics - instead of thousands of costly real-world experiments with a physical manipulator, the needed scenarios are generated as video. Boston Dynamics and Figure AI already use simulation for pre-training. Other applications: operator training simulations, architectural project previews, fast UI/UX prototyping without live shooting.

What is the fundamental difference between Google Genie and OpenAI Sora?

JEPA: Meta AI's Architecture of the Future

**JEPA (Joint Embedding Predictive Architecture)** is an architecture proposed by Yann LeCun as an alternative to both LLMs and generative models. The key idea: instead of predicting pixels (like Sora) or tokens (like GPT), predict **abstract representations** of future states.

**V-JEPA** (Video Joint Embedding Predictive Architecture, Meta, February 2024) is the first JEPA implementation for video. The model learns to predict embeddings of masked video segments - like BERT with masked tokens, but operating in video representation space. Result: V-JEPA achieves state-of-the-art on action recognition without a single text annotation - purely from observation of frame sequences.

JEPA solves a problem that plagues generative models: **prediction in pixel space is inefficient**. When Sora predicts the next frame, roughly 90% of compute goes to background, textures, lighting - details that don't affect understanding. The human brain doesn't predict every photon - it builds abstract models of objects and their interactions. V-JEPA does the same: it learns to think in terms of "ball moving down-right" rather than "pixels [123, 45, 67] shift by vector (2, 3)".

Approach	Predicts	Pros	Cons
LLM (GPT)	Next token	Scales well, universal	Text only, no physics
Generative (Sora)	Next pixels	Visually realistic	Inefficient, shallow understanding
JEPA (V-JEPA)	Abstract representations	Efficient, semantic understanding	Early research stage, doesn't generate content
Hybrid (future)	Representations + generation	Best of both worlds	Doesn't exist yet

**JEPA status (2025):** still a research project, not a product. V-JEPA achieves state-of-the-art on video understanding tasks but cannot generate content. LeCun positions JEPA as a foundation for AGI - a 5-10 year path. Skeptics point out that there are no concrete products built on JEPA yet.

What is the key difference between JEPA and generative models like Sora?

Causal Reasoning: From Correlation to Understanding Causes

Current AI systems - including LLMs and most world models - primarily work with **correlations**: "X often appears together with Y." But true understanding requires **causality**: "X *causes* Y." GPT-4 knows that ice cream sales and drownings are correlated - but without a causal graph it cannot explain that the common cause is hot weather. For recommendation systems this produces systematic errors: the model recommends a "treatment" that merely accompanies recovery without causing it.

**Judea Pearl** (Turing Award 2011) formalized the theory of causality in AI through the "Ladder of Causation." Current LLMs - GPT-4, Claude, Gemini - operate at Level 1: **associations** ("What usually happens?"). AGI requires Level 2 - **interventions** ("What happens if this drug is administered?") and Level 3 - **counterfactuals** ("Would the patient have survived without surgery?"). The distinction is critical for medical AI, financial decision-making, and autonomous agents.

The practical consequences for AI engineers are already tangible. Recommendation systems confuse correlation with causation: a user bought an umbrella after checking the weather forecast - the model concludes that forecast-checking *causes* umbrella purchases and starts recommending umbrellas to everyone who checks the weather. Medical AI draws flawed conclusions from observational data. Business analytics rest on spurious correlations - an A/B test shows metric improvement, but the real driver is an external factor, not the test.

For robotics, causal understanding is a necessity, not an abstract requirement. A robot with pure correlations ("press button - door opens") fails the moment the button breaks. DeepMind Gato - a multitask agent trained on hundreds of tasks simultaneously - demonstrates exactly this fragility: transfer between tasks breaks down when the causal structure of the environment changes. A robot with a causal model ("button → signal → motor → door") diagnoses the broken link and finds an alternative path.

**Forecast:** by 2027-2030, expect convergence of LLMs, world models, and causal reasoning. Models will be trained on text *and* video *and* interactive environments, building causal graphs of the world. Early signs are already visible: Gemini 2.0 combines text, video, and code; Meta is working on JEPA for robotics; DeepMind is integrating causal inference into its research projects.

Why is the distinction between correlation and causation critical for AI applications?

Key Ideas

LLMs work with symbols - text about physics, not physics itself. A world model builds an internal simulation: it sees a glass on the edge of a table and computes the trajectory of the fall
Sora (DiT) generates photorealistic video, but physics sometimes breaks - these are powerful visual priors, not a simulator. Genie 2 - an interactive world from a single photo, physics baked into the model
JEPA (Meta) predicts in latent space, not pixel space: ~90% of Sora's compute goes to background and shadows - JEPA ignores that and focuses on semantically meaningful changes
Causality per Judea Pearl - three levels: association (LLMs today), intervention (what will happen if...), counterfactual (what would have happened if...). AGI requires at least level 2
Convergence is already happening: Gemini 2.0 unifies text and video, Meta is building JEPA for robotics, DeepMind integrates causal inference - convergence horizon 2027-2030

What's Next

World models are yet another trajectory toward AGI. The next lesson examines the AGI question itself: scaling laws, emergence, and alignment.

The Path to AGI — World models + reasoning = key components on the path to AGI
Reasoning Models — Reasoning in text (o1/o3) complements reasoning about the physical world
Multimodal Models — World models are built on multimodal architectures

Связанные уроки

aie-25-multimodal — World models extend multimodal understanding to video
aie-53-future-reasoning — Physical reasoning complements language reasoning
aie-26-image-generation — Video prediction builds on generative image models
stat-20-causal — Causal understanding uses causal inference foundations
ml-30-rnn-lstm — Sequence prediction over frames mirrors temporal models
ml-11