Predictive Processing and Active Inference
Цели урока
- Understand the brain as a hierarchical prediction machine, not a passive receiver
- Know the Free Energy Principle: F = complexity - accuracy, two ways to minimize
- Explain Active Inference: epistemic value (curiosity) vs pragmatic value (reward)
- See the role of Precision in attention and psychopathology
- Recognize the parallel between PP and LLMs with tool use
Предварительные знания
- POMDPs and partially observable environments (lesson 05)
- Hierarchical models (lesson 06)
- Basic understanding of Bayesian belief updating
80% of visual pathway fibers go top-down. This is not an evolutionary bug - the brain generates a "film" of the world and checks it against reality, rather than constructing an image from pixels.
- Why familiar objects go unnoticed - prediction error is near zero
- Hallucinations as predictions without correction (prior precision too high)
- Anxiety as hypersensitivity to prediction errors (sensory precision too high)
- LLMs and transformers as a literal implementation of predictive processing
- Claude Code as an Active Inference agent: predict -> act -> observe -> correct
From Helmholtz to Friston
In the 1860s, Helmholtz called perception "unconscious inference" - the brain interprets rather than photographs. 150 years later, Karl Friston formalized this idea in the Free Energy Principle (2006), unifying neuroscience, statistical physics, and machine learning into a single mathematical framework.
The Brain as a Prediction Machine
**GPT-4 predicts the next token. The brain predicts the next sensory input. This is not a metaphor - Karl Friston showed in 2005 that both mechanisms are described by the same mathematics.** The classical view: the brain reacts to stimuli. The new view: the brain continuously generates hypotheses about the world and updates them only on errors. This flips neuroscience: 80% of visual pathway fibers go top-down, not bottom-up.
| Model | What the brain does | Role of sensors |
|---|---|---|
| Reactive (classical) | Waits for inputs, then processes | Source of information |
| Predictive (PP) | Continuously generates forecasts | Source of correction errors |
| Consequence | Perception is interpretation, not capture | Sensors report only delta |
The hierarchy flows in two directions. **Top-down**: higher levels send predictions downward - "I expect to see a face". **Bottom-up**: lower levels send only errors upward - "the nose is slightly different". When the prediction is accurate, the error is zero - no signal at all. This is why familiar objects go unnoticed: prediction error is near zero.
Neuroscience fact: the human visual cortex has 10x more descending connections than ascending ones. The brain generates a "film" and compares it against reality - it does not build an image from pixels.
The brain first sees the world, then builds a model of it
The brain builds a model continuously and perceives only the deviations from it
80% of visual fibers are descending. This is not an architectural curiosity - it shows that top-down predictions are the primary process, while sensory data merely corrects it.
Why does a familiar object receive almost no conscious processing?
Free Energy Principle
**Karl Friston proposed a single principle in 2006 that unifies learning, perception, and action - the Free Energy Principle: all living systems minimize free energy F.** The word "energy" comes from physics, but here it is an information-theoretic quantity - an upper bound on "surprise". Minimizing F means minimizing the gap between expectations and reality.
Key insight: there are **two ways** to reduce F, that is, to reduce the gap between model and reality.
| Method | What changes | Example |
|---|---|---|
| Perceptual update | Model is fitted to the world | Saw there is no milk - updated the belief |
| Active action | World is fitted to the model | Went to the store - world now matches prediction |
| Combination | Partially both | Bayesian weighting by precision |
Precision is inverse variance: Precision = 1 / Variance. High precision on the model means the agent trusts its predictions and will act to make the world match them. High precision on sensors means the agent trusts observations and will update the model.
The Free Energy Principle is thermodynamics applied to the brain
The term is borrowed but refers to an information-theoretic quantity - KL divergence between beliefs and reality
Friston deliberately used physics terminology to connect with the principle of minimal energy. In practice F = complexity - accuracy, where both terms are information quantities, not joules.
According to the Free Energy Principle, what happens when prior precision is very high (the model is very confident)?
Active Inference and Precision
**Active Inference is when an agent does not passively update its model but actively changes the world to match its predictions.** Action becomes a self-fulfilling prophecy. Expected Free Energy (G) determines which action to choose: it balances curiosity (epistemic value - learn something new) and reward (pragmatic value - achieve the goal).
| Component of G | Question | Agent behavior |
|---|---|---|
| Epistemic value | What can be learned? | Exploration under high uncertainty |
| Pragmatic value | Is the goal being reached? | Exploitation with a known model |
| Balance | Exploration vs exploitation? | Automatic based on uncertainty level |
**Precision Weighting** is the mechanism for controlling attention. Precision = 1/Variance: high precision on a signal means "trust this", low means "ignore". The brain dynamically adjusts precision at each level of the hierarchy.
| Precision imbalance | Result | Clinical |
|---|---|---|
| Too high sensory precision | Every input is alarming | Anxiety, hypervigilance |
| Too high prior precision | Model overrides reality | Delusions, hallucinations |
| Unstable precision | Context processing difficulties | Autism spectrum conditions |
Attention in transformers (query-key-value) is functionally analogous to Precision Weighting: both mechanisms dynamically weight which signals matter for the current computation. This is not coincidental - Friston actively investigates this parallel.
An agent enters an unfamiliar environment with high uncertainty. What does Active Inference predict?
Predictive Processing and LLMs
**GPT-4 is trained by predicting the next token. Cross-entropy loss is the prediction error. The transformer minimizes "surprise" on a text corpus - exactly as the brain does under Friston's framework.** This is not a metaphor: the mathematics is literally the same. The difference is that LLMs predict tokens while the brain predicts world states; LLMs without tool use cannot change the world, but the brain can.
Claude Code is an example of an Active Inference LLM: it generates a prediction of the needed code, compares it with the goal, calls tools (bash, edit), observes the result, and corrects its approach. The cycle continues until the prediction error reaches zero (task solved).
- Brain (PP) — Predicts world states. Hierarchy of timescales (ms to years). Active inference through muscles. Precision via dopamine/noradrenaline.
- LLM (Transformer) — Predicts tokens. Single timescale (forward pass). Active inference through tool use. Precision via attention weights.
Connections to Other Topics
Predictive Processing unifies several concepts from this course.
- Global Workspace (lesson 11) — PP explains how beliefs are updated; GWT explains what is consciously broadcast. Large prediction errors win the competition for the workspace.
- Self-Models (lesson 9) — The self-model is a predictive model of self. Interoception is prediction of bodily states. Interoception errors are emotions.
- POMDP (lesson 5) — PP generalizes Bayesian inference from POMDP across the full hierarchy of perception and action.
LLMs are just statistical machines over word frequencies, unrelated to the brain
LLMs implement predictive processing - the same mathematics that Friston formalized for the brain in 2006
Cross-entropy loss = surprise minimization = Free Energy in the information-theoretic sense. This is not a metaphor - it is mathematical equivalence. The difference lies in the substrate and in the presence of active inference through action.
How does an LLM with tool use fundamentally differ from one without it, from an Active Inference perspective?
Вопросы для размышления
- If only prediction errors (surprises) reach consciousness - what does this imply about the nature of routine and habit? How would you change a habit through the lens of PP?