Reinforcement Learning

Deep Q-Network (DQN)

2013. A neural network sees only raw pixels - no hand-coded features, no domain knowledge. Seven Atari games. Superhuman score on all seven. Three years later: 49 games, superhuman on 29. The algorithm: DQN. Three engineering tricks - experience replay, target network, double DQN - turned an unstable learning signal into a state-of-the-art system. Those same tricks appear in virtually every deep RL system deployed in production today.

  • **DeepMind Atari (2013-2015)** - DQN achieved superhuman performance on 29/49 Atari games using only raw pixels and the score as reward, published in Nature 2015 and triggering the Deep RL wave
  • **OpenAI Five (Dota 2, 2019)** - used replay buffers with 900,000 transitions and target network variants to train a team of five RL agents that defeated the world champions OG (2-0)
  • **Google Recommendation Systems** - YouTube, Play Store and Search all use DQN-derived algorithms for slate recommendation; Google reported 30-50% engagement improvements versus supervised baselines

DQN: deep learning meets Q-learning

In 2013 Volodymyr Mnih and colleagues at DeepMind presented 'Playing Atari with Deep Reinforcement Learning' at a NIPS workshop: a convolutional network learning to play Atari games from raw pixels and the score alone. In 2015 the full version appeared in Nature as 'Human-level control through deep reinforcement learning', reaching human-level or better play on 29 of 49 Atari games with one architecture and one set of hyperparameters. Experience replay and a separate target network were the two stabilizing tricks that made training a neural Q-function practical. The paper opened the deep reinforcement learning era and led directly to AlphaGo.

Предварительные знания

  • Q-learning and the Bellman update for Q(s,a)
  • Neural networks and backpropagation
  • Exploration vs exploitation and epsilon-greedy
  • Stochastic gradient descent and loss functions
  • TD Learning and Q-Learning
  • Neural networks

Experience Replay

2013. DeepMind publishes a paper: one neural network, raw pixels as input, 49 Atari games. No handcrafted features. Result: superhuman performance on 29 of them. The core trick was not the architecture - it was **experience replay**. Training a neural network on consecutive game frames breaks it. Frame t and frame t+1 are nearly identical - the gradients are correlated, weights oscillate, the network diverges. Experience replay breaks that correlation by storing transitions in a buffer and sampling random mini-batches.

**Experience Replay Buffer** stores tuples (state, action, reward, next_state, done). During training the network samples a random mini-batch from this buffer instead of using the most recent transition. This decorrelates the training data and makes gradient updates more stable - the same mechanism used in every major DRL system from DQN to Dreamer.

DeepMind used a buffer of **1 million transitions** for Atari DQN. The warmup phase (random actions before learning) fills the buffer with diverse experience, preventing early overfitting to whatever the agent happens to do first.

Why does training a Q-network on consecutive transitions (without experience replay) cause instability?

Target Network

Even with replay buffer, early DQN attempts diverged. The culprit: the Q-network updates toward a target that **itself changes every step**. It is like trying to hit a moving target while the target moves every time you fire. The Bellman target for Q-learning is: `y = r + γ · max_a' Q(s', a')`. If Q is the same network being updated, every gradient step shifts both the prediction and the target simultaneously - a feedback loop that spirals.

**Target Network** is a frozen copy of the online Q-network. Targets are computed using this frozen copy, which is updated to match the online network only every C steps (DeepMind used C = 10,000 for Atari). This breaks the feedback loop: the target stays fixed long enough for the online network to chase a stable signal.

**Soft updates** (Polyak averaging) are an alternative to hard resets: `θ_target = τ·θ_online + (1-τ)·θ_target` with τ = 0.005. Used in actor-critic methods (DDPG, SAC, TD3) where hard resets every 10K steps would be too slow for continuous action spaces.

What problem does the target network solve in DQN?

Double DQN

Standard DQN systematically **overestimates** Q-values. The max operator in the Bellman target `max_a' Q(s', a')` always picks the highest Q-value, but Q-values have estimation noise - so it consistently picks upward-biased estimates. Over millions of updates, this noise accumulates. In 2015 DeepMind's Hado van Hasselt showed this overestimation is not just theoretically problematic: on Atari games, DQN estimated Q-values 2-10x higher than actual returns.

**Double DQN** decouples action selection from action evaluation. The online network selects which action is best (`argmax_a' Q_online(s', a')`). The target network evaluates that action's value (`Q_target(s', a*)`). Since two separate networks make the selection and evaluation, the upward bias cancels out.

MethodAction SelectionAction EvaluationOverestimation
DQNtarget_nettarget_netHigh
Double DQNonline_nettarget_netLow
Double Q-learning (original)Q1Q2 (separate)Very low

Double DQN reduces Q-value overestimation by:

Dueling Network Architecture

In many states, the choice of action barely matters. Standing in the middle of an empty hallway in Atari Enduro: left or right? Both are equally fine - the car is far away. Standard DQN must still learn a Q-value for each action separately. **Dueling DQN** (Wang et al., 2016) decomposes Q into two components: **V(s)** (how good is this state regardless of action) and **A(s,a)** (advantage of action a over average). The network shares a feature extractor but splits into two heads.

**Q(s,a) = V(s) + A(s,a) - mean_a'[A(s,a')]** Subtracting the mean advantage makes the decomposition unique (identifiability). The advantage function centers around zero: good actions have positive advantage, bad actions have negative. The value stream learns which states are inherently dangerous or rewarding - useful even when action doesn't matter.

**Rainbow DQN** (DeepMind, 2017) combines six improvements: Double DQN, Dueling, Prioritized Replay, Multi-step returns, Distributional RL, and Noisy Networks. On Atari-57, Rainbow achieves the same performance as DQN in 7x fewer environment steps. Each component contributes; Prioritized Replay and Multi-step returns turn out to have the largest individual impact.

DQN improvements (Double, Dueling, PER) are incremental tweaks with marginal gains

Each major DQN improvement addresses a distinct failure mode and provides significant gains; combined in Rainbow they reduce sample requirements by 7x

Experience replay fixes data correlation. Target network fixes moving targets. Double DQN fixes overestimation bias. Dueling fixes inefficient learning in state-dominated scenarios. Prioritized Replay fixes uniform sampling waste. These are orthogonal problems with orthogonal fixes - combining them compounds the benefit.

What is the purpose of subtracting mean(A) in the dueling formula Q = V + A - mean(A)?

Key ideas

  • **Experience Replay** stores (s, a, r, s', done) tuples in a circular buffer; training samples random mini-batches, breaking temporal correlation that destabilizes neural network updates
  • **Target Network** is a frozen copy of the online network, synced every C steps (hard update) or blended continuously (soft update); it prevents the Bellman target from chasing itself
  • **Double DQN** fixes Q-value overestimation by separating action selection (online network) from action evaluation (target network) in the Bellman backup
  • **Dueling Architecture** splits Q(s,a) into V(s) + A(s,a) - mean(A); the value stream learns state quality independently of actions, accelerating learning in states where action choice matters little

Related topics

DQN is the bridge between classical Q-learning and modern deep RL:

  • Q-Learning and Temporal Difference — DQN is Q-learning with a neural network as function approximator; all the classical TD theory applies
  • Policy Gradient: REINFORCE — An alternative family that directly optimizes the policy rather than learning Q-values; avoids the max operator overestimation issue
  • Proximal Policy Optimization (PPO) — The dominant modern algorithm combining actor-critic with a stability trick analogous to the target network concept

Вопросы для размышления

  • Experience replay breaks the temporal ordering of data. In what scenarios might this be harmful - where does the i.i.d. assumption of replay actually fail?
  • The target network is updated every C = 10,000 steps in the original DQN. What happens if C is too small? Too large? How does soft updating (Polyak averaging) trade off between these extremes?
  • Rainbow combines six improvements. If compute is limited and only two can be chosen, which two provide the most benefit - and why might the answer differ between sparse-reward and dense-reward environments?

Связанные уроки

  • rl-06 — Q-Learning is the algorithm DQN approximates with a network
  • rl-08 — Policy gradient is the alternative deep RL family
  • ml-25-neural-networks — Neural networks approximate the Q-function
  • ml-49-q-learning — DQN scales tabular Q-learning to high-dimensional states
  • ml-28-optimizers — Gradient optimizers train the Q-network stably
  • dl-01
Deep Q-Network (DQN)

0

1

Sign In