Automata and Cognition
Goal Hierarchy - Options Framework
Цели урока
- Understand why flat RL fails on long-horizon tasks
- Know the Options formalism: initiation set, internal policy, termination condition
- Understand Semi-MDP and how discounting works for options of variable length
- See how Feudal RL and World Models solve hierarchical planning
Предварительные знания
- MDP: states, actions, policy, value function
- POMDP: partial observability and belief state
- Concept of discount factor gamma in RL
Flat RL with a 1000-step horizon has 10^1000 possible trajectories - more than atoms in the universe. The only way out is hierarchy: split the task into options, each solving its own piece.
- **AlphaGo (2016)** - 1202 GPUs, 40 days of training on 30 million games. Even then, the 10^170 state space required smart heuristics
- **Claude Code** - hierarchy: user goal -> options (explore/fix/implement) -> tools (Glob/Read/Edit/Bash)
- **Robotics (Boston Dynamics)** - hierarchy: task (fetch object) -> locomotion (how to walk) -> motor commands
- **Video games (AlphaStar, OpenAI Five)** - thousands of steps per match, only hierarchical methods succeeded
- **DeepMind Feudal Networks (2017)** - 3x faster than flat A3C on sparse-reward tasks
Options Framework - Sutton, Precup, Singh 1999
The paper 'Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning' formalized the idea of macro-actions. Hierarchical approaches existed before, but without rigorous math. Options provided a unified language - initiation set, policy, termination - and proved that any MDP can be solved via options without loss of optimality. Sutton is also the author of the main RL textbook (Sutton & Barto), freely available online.
Why flat RL breaks on long-horizon tasks
**AlphaGo (2016) defeated Lee Sedol, but required 1202 GPUs and 40 days of training on 30 million games.** Go averages 250 moves with a state space of 10^170. When the planning horizon reaches thousands of steps, the number of possible trajectories becomes 10^1000 - more than atoms in the observable universe. Flat RL cannot handle this. Hierarchy is the only way out.
**The curse of horizon:** with horizon T steps and action alphabet size |A|, the number of possible trajectories equals |A|^T. For T=1000 and |A|=10, that is 10^1000 - no amount of compute can search this space.
What breaks in flat MDP
| Problem | Why it happens | Example |
|---|---|---|
| Sparse rewards | Win only at the end - thousands of steps without signal | Collect resources, build base, win - one reward at the end |
| State space explosion | Position x resources x buildings x enemies = millions | Game agent in Minecraft |
| No skill transfer | Every task trained from scratch | Skill 'open door' cannot be reused |
| Slow training | Agent has no intermediate goals | AlphaGo: 40 days on 1202 GPUs |
The solution is to split the task into levels. The upper level plans abstractly ('open the door'), the lower level executes primitive actions ('flex bicep, rotate arm'). Each level works with a short horizon.
Adding more compute will solve the long-horizon problem
10^1000 exceeds the number of atoms in the universe - no hardware helps without changing the approach
Exponential explosion cannot be defeated by linear resource scaling. The only solution is to reduce the effective horizon through hierarchy, compressing 1000 steps into 10 options.
Why does the number of trajectories equal 10^1000 when T=1000 and |A|=10?
Options Framework - the formalism for macro-actions
**Richard Sutton's 1999 paper 'Between MDPs and semi-MDPs' gave agents the concept of a 'skill': a sequence of primitive actions invocable as a single macro-action.** Instead of planning 1000 steps, plan 10 options. This formalism became the foundation of all Hierarchical RL.
**Option** is a triple (I, pi, beta): initiation set I (states where the option can start), internal policy pi (what to do inside the option), termination condition beta (probability of stopping at each state).
Option components with an example
| Component | What it is | Example: option 'fix bug' |
|---|---|---|
| I - initiation set | Condition to start the option | User reported an error |
| pi - internal policy | What to do step by step | Read file -> Edit code -> Bash test |
| beta - termination | P(stop | current state) | beta=1.0 when tests pass, beta=0.0 otherwise |
Option hierarchy in Claude Code
Options are just functions broken into subfunctions - standard code decomposition
Options are a formal model of temporal abstraction with their own policy and termination condition
Regular code decomposition is static. An option is dynamic: it decides when to continue and when to stop (via beta). This lets the upper level plan without knowing the lower level's details.
Why does an option need the termination component (beta)?
Semi-MDP and planning in imagination
**A regular MDP assumes every action takes exactly one step. Options break this assumption.** Option 'explore codebase' takes 5-20 steps, option 'fix bug' takes 3-50. When actions have variable duration, MDP becomes Semi-MDP (SMDP) - and the math changes accordingly.
**Semi-MDP (SMDP)** - MDP extension where actions (options) take a random number of steps k. Q-value of option omega in state s: Q(s, omega) = E[sum(gamma^t * r_t, t=0..k-1) + gamma^k * Q(s_k, pi(s_k))]. Key insight: the discount gamma^k accumulates over the entire duration of the option.
| Parameter | Meaning | Effect on learning |
|---|---|---|
| k - option length | Random number of steps | Longer option means stronger discounting of its reward |
| gamma^k - accumulated discount | Future rewards lose value | At gamma=0.9 and k=10: gamma^10 = 0.35 |
| r_t - intermediate rewards | Each step inside the option yields r_t | Option accumulates all intermediate rewards |
| s_k - final state | State after option terminates | The next option starts from s_k |
World Model - planning without real actions
When an agent has a world model (a function predicting the next state and reward), it can 'simulate' options in imagination before choosing the best one. This is model-based RL - and it is how strategic planning works in humans.
| Approach | How decisions are made | Pros / Cons |
|---|---|---|
| Model-Free RL | State -> policy -> action (directly) | Fast, but does not plan ahead |
| Model-Based RL | State -> simulate N options -> pick best -> action | Plans ahead, but requires an accurate world model |
| Dyna-Q (Sutton) | Real experience + simulated experience from model | Balance: planning + real-world exploration |
Option A: 10 steps, total reward +50. Option B: 3 steps, total reward +20. gamma=0.9. Reward comes at the end. Which is better?
Feudal RL and option discovery
**DeepMind's 2017 Feudal Networks paper showed that a Manager-Worker hierarchy learns 3x faster than flat A3C on sparse-reward tasks.** The Manager sets abstract goals on a long horizon; the Worker executes primitive actions to reach each goal. This separation of concerns makes each level simpler to train.
Three ways to create options
| Approach | How options are obtained | When to use |
|---|---|---|
| Hand-crafted | Human defines options explicitly | When task structure is known - Claude Code, robotics |
| Option Discovery (subgoal) | Find 'bottleneck' states in the state graph | When structure is unknown - games, navigation |
| Feudal Networks (FuN) | Manager implicitly learns options via goals | When hand-crafting is infeasible |
**Claude Code = hand-crafted hierarchy:** user goal (level 3) -> options: explore/fix/implement (level 2) -> tools: Glob/Read/Edit/Bash (level 1). Options are defined by engineers explicitly - this lets the agent use correct skills immediately without training from scratch.
Hierarchy in the course ecosystem
Connections to other topics
- MDP — Options build on top of MDP, turning it into Semi-MDP
- POMDP — In partial observability, options allow acting at a more abstract level
- Attention and Memory — World Models require memory of past states - the topic of the next lesson
The World Model must be a perfect copy of reality for planning to work
Even an approximate model is useful - it helps rule out directly bad options before acting
A perfect world model is impossible and unnecessary. The goal is to reduce the search space by discarding bad options upfront. The human brain also works with a 'blurry' model of reality and still makes good decisions.
Why does the Worker in Feudal RL receive intrinsic reward instead of external reward?
Key Takeaways
- **Flat RL** degrades on long-horizon tasks - state space grows exponentially, reward signal is sparse
- **Options Framework** - the formalism for macro-actions: option = policy + termination condition + initiation set
- **Semi-MDP** - temporal abstraction where the agent acts at the level of options with variable duration, not primitive steps
- **World model** in imagination enables planning through options without real environment interaction
- **Feudal RL** - manager/worker hierarchy: manager sets subgoals, worker achieves them; options are discovered automatically
- **Temporal abstraction** - the key principle: agents at different hierarchy levels operate at different time scales
Вопросы для размышления
- What implicit 'options' does a developer use when writing code every day - and how does this affect speed compared to a developer who works only at the level of primitive steps?