Automata and Cognition

Goal Hierarchy - Options Framework

Цели урока

  • Understand why flat RL fails on long-horizon tasks
  • Know the Options formalism: initiation set, internal policy, termination condition
  • Understand Semi-MDP and how discounting works for options of variable length
  • See how Feudal RL and World Models solve hierarchical planning

Предварительные знания

  • MDP: states, actions, policy, value function
  • POMDP: partial observability and belief state
  • Concept of discount factor gamma in RL

Flat RL with a 1000-step horizon has 10^1000 possible trajectories - more than atoms in the universe. The only way out is hierarchy: split the task into options, each solving its own piece.

  • **AlphaGo (2016)** - 1202 GPUs, 40 days of training on 30 million games. Even then, the 10^170 state space required smart heuristics
  • **Claude Code** - hierarchy: user goal -> options (explore/fix/implement) -> tools (Glob/Read/Edit/Bash)
  • **Robotics (Boston Dynamics)** - hierarchy: task (fetch object) -> locomotion (how to walk) -> motor commands
  • **Video games (AlphaStar, OpenAI Five)** - thousands of steps per match, only hierarchical methods succeeded
  • **DeepMind Feudal Networks (2017)** - 3x faster than flat A3C on sparse-reward tasks

Options Framework - Sutton, Precup, Singh 1999

The paper 'Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning' formalized the idea of macro-actions. Hierarchical approaches existed before, but without rigorous math. Options provided a unified language - initiation set, policy, termination - and proved that any MDP can be solved via options without loss of optimality. Sutton is also the author of the main RL textbook (Sutton & Barto), freely available online.

Why flat RL breaks on long-horizon tasks

**AlphaGo (2016) defeated Lee Sedol, but required 1202 GPUs and 40 days of training on 30 million games.** Go averages 250 moves with a state space of 10^170. When the planning horizon reaches thousands of steps, the number of possible trajectories becomes 10^1000 - more than atoms in the observable universe. Flat RL cannot handle this. Hierarchy is the only way out.

**The curse of horizon:** with horizon T steps and action alphabet size |A|, the number of possible trajectories equals |A|^T. For T=1000 and |A|=10, that is 10^1000 - no amount of compute can search this space.

What breaks in flat MDP

ProblemWhy it happensExample
Sparse rewardsWin only at the end - thousands of steps without signalCollect resources, build base, win - one reward at the end
State space explosionPosition x resources x buildings x enemies = millionsGame agent in Minecraft
No skill transferEvery task trained from scratchSkill 'open door' cannot be reused
Slow trainingAgent has no intermediate goalsAlphaGo: 40 days on 1202 GPUs

The solution is to split the task into levels. The upper level plans abstractly ('open the door'), the lower level executes primitive actions ('flex bicep, rotate arm'). Each level works with a short horizon.

Adding more compute will solve the long-horizon problem

10^1000 exceeds the number of atoms in the universe - no hardware helps without changing the approach

Exponential explosion cannot be defeated by linear resource scaling. The only solution is to reduce the effective horizon through hierarchy, compressing 1000 steps into 10 options.

Why does the number of trajectories equal 10^1000 when T=1000 and |A|=10?

Options Framework - the formalism for macro-actions

**Richard Sutton's 1999 paper 'Between MDPs and semi-MDPs' gave agents the concept of a 'skill': a sequence of primitive actions invocable as a single macro-action.** Instead of planning 1000 steps, plan 10 options. This formalism became the foundation of all Hierarchical RL.

**Option** is a triple (I, pi, beta): initiation set I (states where the option can start), internal policy pi (what to do inside the option), termination condition beta (probability of stopping at each state).

Option components with an example

ComponentWhat it isExample: option 'fix bug'
I - initiation setCondition to start the optionUser reported an error
pi - internal policyWhat to do step by stepRead file -> Edit code -> Bash test
beta - terminationP(stop | current state)beta=1.0 when tests pass, beta=0.0 otherwise

Option hierarchy in Claude Code

Options are just functions broken into subfunctions - standard code decomposition

Options are a formal model of temporal abstraction with their own policy and termination condition

Regular code decomposition is static. An option is dynamic: it decides when to continue and when to stop (via beta). This lets the upper level plan without knowing the lower level's details.

Why does an option need the termination component (beta)?

Semi-MDP and planning in imagination

**A regular MDP assumes every action takes exactly one step. Options break this assumption.** Option 'explore codebase' takes 5-20 steps, option 'fix bug' takes 3-50. When actions have variable duration, MDP becomes Semi-MDP (SMDP) - and the math changes accordingly.

**Semi-MDP (SMDP)** - MDP extension where actions (options) take a random number of steps k. Q-value of option omega in state s: Q(s, omega) = E[sum(gamma^t * r_t, t=0..k-1) + gamma^k * Q(s_k, pi(s_k))]. Key insight: the discount gamma^k accumulates over the entire duration of the option.

ParameterMeaningEffect on learning
k - option lengthRandom number of stepsLonger option means stronger discounting of its reward
gamma^k - accumulated discountFuture rewards lose valueAt gamma=0.9 and k=10: gamma^10 = 0.35
r_t - intermediate rewardsEach step inside the option yields r_tOption accumulates all intermediate rewards
s_k - final stateState after option terminatesThe next option starts from s_k

World Model - planning without real actions

When an agent has a world model (a function predicting the next state and reward), it can 'simulate' options in imagination before choosing the best one. This is model-based RL - and it is how strategic planning works in humans.

ApproachHow decisions are madePros / Cons
Model-Free RLState -> policy -> action (directly)Fast, but does not plan ahead
Model-Based RLState -> simulate N options -> pick best -> actionPlans ahead, but requires an accurate world model
Dyna-Q (Sutton)Real experience + simulated experience from modelBalance: planning + real-world exploration

Option A: 10 steps, total reward +50. Option B: 3 steps, total reward +20. gamma=0.9. Reward comes at the end. Which is better?

Feudal RL and option discovery

**DeepMind's 2017 Feudal Networks paper showed that a Manager-Worker hierarchy learns 3x faster than flat A3C on sparse-reward tasks.** The Manager sets abstract goals on a long horizon; the Worker executes primitive actions to reach each goal. This separation of concerns makes each level simpler to train.

Three ways to create options

ApproachHow options are obtainedWhen to use
Hand-craftedHuman defines options explicitlyWhen task structure is known - Claude Code, robotics
Option Discovery (subgoal)Find 'bottleneck' states in the state graphWhen structure is unknown - games, navigation
Feudal Networks (FuN)Manager implicitly learns options via goalsWhen hand-crafting is infeasible

**Claude Code = hand-crafted hierarchy:** user goal (level 3) -> options: explore/fix/implement (level 2) -> tools: Glob/Read/Edit/Bash (level 1). Options are defined by engineers explicitly - this lets the agent use correct skills immediately without training from scratch.

Hierarchy in the course ecosystem

Connections to other topics

  • MDP — Options build on top of MDP, turning it into Semi-MDP
  • POMDP — In partial observability, options allow acting at a more abstract level
  • Attention and Memory — World Models require memory of past states - the topic of the next lesson

The World Model must be a perfect copy of reality for planning to work

Even an approximate model is useful - it helps rule out directly bad options before acting

A perfect world model is impossible and unnecessary. The goal is to reduce the search space by discarding bad options upfront. The human brain also works with a 'blurry' model of reality and still makes good decisions.

Why does the Worker in Feudal RL receive intrinsic reward instead of external reward?

Key Takeaways

  • **Flat RL** degrades on long-horizon tasks - state space grows exponentially, reward signal is sparse
  • **Options Framework** - the formalism for macro-actions: option = policy + termination condition + initiation set
  • **Semi-MDP** - temporal abstraction where the agent acts at the level of options with variable duration, not primitive steps
  • **World model** in imagination enables planning through options without real environment interaction
  • **Feudal RL** - manager/worker hierarchy: manager sets subgoals, worker achieves them; options are discovered automatically
  • **Temporal abstraction** - the key principle: agents at different hierarchy levels operate at different time scales

Вопросы для размышления

  • What implicit 'options' does a developer use when writing code every day - and how does this affect speed compared to a developer who works only at the level of primitive steps?

Связанные уроки

  • ml-01-intro
Goal Hierarchy - Options Framework

0

1

Sign In