Automata and Cognition

Goal Hierarchy - Options Framework

Цели урока

Understand why flat RL fails on long-horizon tasks
Know the Options formalism: initiation set, internal policy, termination condition
Understand Semi-MDP and how discounting works for options of variable length
See how Feudal RL and World Models solve hierarchical planning

Предварительные знания

MDP: states, actions, policy, value function
POMDP: partial observability and belief state
Concept of discount factor gamma in RL

Flat RL with a 1000-step horizon has 10^1000 possible trajectories - more than atoms in the universe. The only way out is hierarchy: split the task into options, each solving its own piece.

**AlphaGo (2016)** - 1202 GPUs, 40 days of training on 30 million games. Even then, the 10^170 state space required smart heuristics
**Claude Code** - hierarchy: user goal -> options (explore/fix/implement) -> tools (Glob/Read/Edit/Bash)
**Robotics (Boston Dynamics)** - hierarchy: task (fetch object) -> locomotion (how to walk) -> motor commands
**Video games (AlphaStar, OpenAI Five)** - thousands of steps per match, only hierarchical methods succeeded
**DeepMind Feudal Networks (2017)** - 3x faster than flat A3C on sparse-reward tasks

Options Framework - Sutton, Precup, Singh 1999

The paper 'Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning' formalized the idea of macro-actions. Hierarchical approaches existed before, but without rigorous math. Options provided a unified language - initiation set, policy, termination - and proved that any MDP can be solved via options without loss of optimality. Sutton is also the author of the main RL textbook (Sutton & Barto), freely available online.

Why flat RL breaks on long-horizon tasks

**AlphaGo (2016) defeated Lee Sedol, but required 1202 GPUs and 40 days of training on 30 million games.** Go averages 250 moves with a state space of 10^170. When the planning horizon reaches thousands of steps, the number of possible trajectories becomes 10^1000 - more than atoms in the observable universe. Flat RL cannot handle this. Hierarchy is the only way out.

**The curse of horizon:** with horizon T steps and action alphabet size |A|, the number of possible trajectories equals |A|^T. For T=1000 and |A|=10, that is 10^1000 - no amount of compute can search this space.

What breaks in flat MDP

Problem	Why it happens	Example
Sparse rewards	Win only at the end - thousands of steps without signal	Collect resources, build base, win - one reward at the end
State space explosion	Position x resources x buildings x enemies = millions	Game agent in Minecraft
No skill transfer	Every task trained from scratch	Skill 'open door' cannot be reused
Slow training	Agent has no intermediate goals	AlphaGo: 40 days on 1202 GPUs

The solution is to split the task into levels. The upper level plans abstractly ('open the door'), the lower level executes primitive actions ('flex bicep, rotate arm'). Each level works with a short horizon.

Adding more compute will solve the long-horizon problem

10^1000 exceeds the number of atoms in the universe - no hardware helps without changing the approach

Exponential explosion cannot be defeated by linear resource scaling. The only solution is to reduce the effective horizon through hierarchy, compressing 1000 steps into 10 options.

Why does the number of trajectories equal 10^1000 when T=1000 and |A|=10?

Options Framework - the formalism for macro-actions

**Richard Sutton's 1999 paper 'Between MDPs and semi-MDPs' gave agents the concept of a 'skill': a sequence of primitive actions invocable as a single macro-action.** Instead of planning 1000 steps, plan 10 options. This formalism became the foundation of all Hierarchical RL.

**Option** is a triple (I, pi, beta): initiation set I (states where the option can start), internal policy pi (what to do inside the option), termination condition beta (probability of stopping at each state).

Option components with an example

Component	What it is	Example: option 'fix bug'
I - initiation set	Condition to start the option	User reported an error
pi - internal policy	What to do step by step	Read file -> Edit code -> Bash test
beta - termination	P(stop \| current state)	beta=1.0 when tests pass, beta=0.0 otherwise

Option hierarchy in Claude Code

Options are just functions broken into subfunctions - standard code decomposition

Options are a formal model of temporal abstraction with their own policy and termination condition

Regular code decomposition is static. An option is dynamic: it decides when to continue and when to stop (via beta). This lets the upper level plan without knowing the lower level's details.

Why does an option need the termination component (beta)?

Semi-MDP and planning in imagination

**A regular MDP assumes every action takes exactly one step. Options break this assumption.** Option 'explore codebase' takes 5-20 steps, option 'fix bug' takes 3-50. When actions have variable duration, MDP becomes Semi-MDP (SMDP) - and the math changes accordingly.

**Semi-MDP (SMDP)** - MDP extension where actions (options) take a random number of steps k. Q-value of option omega in state s: Q(s, omega) = E[sum(gamma^t * r_t, t=0..k-1) + gamma^k * Q(s_k, pi(s_k))]. Key insight: the discount gamma^k accumulates over the entire duration of the option.

Parameter	Meaning	Effect on learning
k - option length	Random number of steps	Longer option means stronger discounting of its reward
gamma^k - accumulated discount	Future rewards lose value	At gamma=0.9 and k=10: gamma^10 = 0.35
r_t - intermediate rewards	Each step inside the option yields r_t	Option accumulates all intermediate rewards
s_k - final state	State after option terminates	The next option starts from s_k

World Model - planning without real actions

When an agent has a world model (a function predicting the next state and reward), it can 'simulate' options in imagination before choosing the best one. This is model-based RL - and it is how strategic planning works in humans.

Approach	How decisions are made	Pros / Cons
Model-Free RL	State -> policy -> action (directly)	Fast, but does not plan ahead
Model-Based RL	State -> simulate N options -> pick best -> action	Plans ahead, but requires an accurate world model
Dyna-Q (Sutton)	Real experience + simulated experience from model	Balance: planning + real-world exploration

Option A: 10 steps, total reward +50. Option B: 3 steps, total reward +20. gamma=0.9. Reward comes at the end. Which is better?

Feudal RL and option discovery

**DeepMind's 2017 Feudal Networks paper showed that a Manager-Worker hierarchy learns 3x faster than flat A3C on sparse-reward tasks.** The Manager sets abstract goals on a long horizon; the Worker executes primitive actions to reach each goal. This separation of concerns makes each level simpler to train.

Three ways to create options

Approach	How options are obtained	When to use
Hand-crafted	Human defines options explicitly	When task structure is known - Claude Code, robotics
Option Discovery (subgoal)	Find 'bottleneck' states in the state graph	When structure is unknown - games, navigation
Feudal Networks (FuN)	Manager implicitly learns options via goals	When hand-crafting is infeasible

**Claude Code = hand-crafted hierarchy:** user goal (level 3) -> options: explore/fix/implement (level 2) -> tools: Glob/Read/Edit/Bash (level 1). Options are defined by engineers explicitly - this lets the agent use correct skills immediately without training from scratch.

Hierarchy in the course ecosystem

Connections to other topics

MDP — Options build on top of MDP, turning it into Semi-MDP
POMDP — In partial observability, options allow acting at a more abstract level
Attention and Memory — World Models require memory of past states - the topic of the next lesson

The World Model must be a perfect copy of reality for planning to work

Even an approximate model is useful - it helps rule out directly bad options before acting

A perfect world model is impossible and unnecessary. The goal is to reduce the search space by discarding bad options upfront. The human brain also works with a 'blurry' model of reality and still makes good decisions.

Why does the Worker in Feudal RL receive intrinsic reward instead of external reward?

Key Takeaways

**Flat RL** degrades on long-horizon tasks - state space grows exponentially, reward signal is sparse
**Options Framework** - the formalism for macro-actions: option = policy + termination condition + initiation set
**Semi-MDP** - temporal abstraction where the agent acts at the level of options with variable duration, not primitive steps
**World model** in imagination enables planning through options without real environment interaction
**Feudal RL** - manager/worker hierarchy: manager sets subgoals, worker achieves them; options are discovered automatically
**Temporal abstraction** - the key principle: agents at different hierarchy levels operate at different time scales

Вопросы для размышления

What implicit 'options' does a developer use when writing code every day - and how does this affect speed compared to a developer who works only at the level of primitive steps?

Связанные уроки

ml-01-intro