Reinforcement Learning
Actor-Critic: A2C, A3C
REINFORCE estimates gradients from complete episodes - the signal is unbiased but noisy. Every action is praised or blamed equally for the episode outcome, even accidental ones. Actor-Critic introduces a Critic network that evaluates each state: now the Actor gets per-step feedback instead of episode-level noise. The result is faster, stabler learning.
- **OpenAI Five** (Dota 2) used an A3C variant with thousands of parallel environments
- **A2C** is a standard baseline in modern RL libraries (Stable-Baselines3, CleanRL)
- **GAE** is used in PPO, TRPO and most modern actor-critic algorithms
- **Entropy bonus** is a standard component in all policy gradient methods
From Barto-Sutton-Anderson to A3C
The actor-critic idea dates to 1983, when Andrew Barto, Richard Sutton, and Charles Anderson published "Neuronlike adaptive elements that can solve difficult learning control problems", training the pole-balancing cart with a separate action unit (actor) and an adaptive critic. The architecture stayed mostly academic until 2016, when Volodymyr Mnih and colleagues at DeepMind published "Asynchronous Methods for Deep Reinforcement Learning", introducing A3C - many actor-learners running in parallel environments to decorrelate experience without a replay buffer. A2C is the synchronous variant that waits for all workers before a single batched update, and it usually matches A3C while being simpler and more GPU-friendly.
Предварительные знания
Advantage Function: why raw reward is not enough
REINFORCE weights every gradient step by the total return G_t. Every action in an episode shares the same weight, regardless of whether that particular step was clever or accidental. The signal says 'good episode' but not 'good action'.
The **advantage function** fixes this: instead of the absolute reward, it measures how much better an action was than the average expectation for that state. A(s, a) = Q(s, a) - V(s). If the agent expected +5 and received +7, advantage = +2. That number is far more informative than +7 alone.
**Baseline trick:** V(s) acts as a baseline. Subtracting it does not bias the gradient estimator (any baseline independent of the action leaves the expectation unchanged) but dramatically reduces variance. This is provably optimal in the class of linear baselines.
Advantage A(s,a) = Q(s,a) - V(s). When A > 0, this means:
Parallel training in A3C
A central problem in RL is **strong correlation between consecutive transitions**. An agent training on a single environment collects highly similar experiences, which causes the network to overfit to local patterns and lose generalisation.
**A3C (Asynchronous Advantage Actor-Critic)**, introduced by DeepMind in 2016, runs N independent workers, each in its own environment copy. Every worker collects experience and computes gradients asynchronously, updating a shared **global** network. Correlation breaks naturally - different environments yield diverse experiences.
**A2C** is the synchronous variant: all workers collect one batch simultaneously, then the model updates once. Simpler to implement, no race conditions, GPU-friendly. In practice it often matches A3C in final performance.
| Property | A2C (synchronous) | A3C (asynchronous) |
|---|---|---|
| Workers | Wait for each other | Update independently |
| GPU efficiency | High (single batch) | Lower (many small updates) |
| Implementation | Simpler | Complex (race conditions) |
| Sample throughput | Limited by slowest env | Higher on CPU-bound envs |
The main reason to use multiple parallel environments in A3C/A2C is:
GAE: Generalized Advantage Estimation
Estimating advantage involves a bias-variance tradeoff. 1-step TD has low variance but biased estimates (the Critic can be wrong). Monte Carlo is unbiased but has high variance. **GAE** (Schulman et al., 2015) elegantly resolves this: a geometrically weighted sum of TD errors across multiple horizons, controlled by a single parameter lambda.
GAE with lambda=0 is equivalent to:
Entropy Bonus for exploration
Actor-Critic without additional measures tends toward **premature convergence**: a reasonably good solution is found and the policy collapses to near-deterministic, cutting off exploration. The entropy bonus is a simple and effective counter-measure.
Policy entropy H(pi(s)) = -sum pi(a|s) log pi(a|s) is maximal for a uniform distribution and zero for a deterministic one. Adding it to the loss with a negative sign (maximising entropy) penalises overconfidence and keeps the policy exploring.
**Tuning entropy_coef:** too large keeps the policy nearly uniform - the agent never converges. Too small and the policy collapses to deterministic. Starting at 0.01 is a reasonable default. In PPO, entropy_coef is often annealed toward zero as training matures.
With a very large entropy_coef, the training outcome is:
Actor-Critic: A2C, A3C
- Advantage A(s,a) = Q(s,a) - V(s) reduces gradient variance without introducing bias
- A3C: N async workers update a global model, breaking transition correlation
- A2C: synchronous variant, simpler and more GPU-efficient
- GAE with lambda controls bias-variance tradeoff between 1-step TD and Monte Carlo
- Entropy bonus -H(pi) maintains exploration and prevents policy collapse
- Full A2C loss = L_actor + c_v * L_critic - c_e * H(pi)
Related topics
Actor-Critic is the foundation for modern RL algorithms. PPO, SAC, and TD3 all build on these ideas.
- Policy Gradient and REINFORCE — The base that Actor-Critic improves on variance
- PPO: Proximal Policy Optimization — Next step: stabilising Actor-Critic with a clipped objective
- Deep Q-Networks — Alternative approach: Critic only, no explicit Actor
Вопросы для размышления
- Why does subtracting the baseline V(s) not bias the gradient estimate while reducing variance?
- What is the fundamental algorithmic difference between A2C and A3C - not technical, but conceptual?
- How does GAE with lambda=0.95 behave in very long episodes (1000+ steps)?
Связанные уроки
- rl-08 — Actor-Critic improves on the REINFORCE estimator
- rl-10 — PPO is a clipped, stabilized actor-critic
- ml-50-policy-gradient — Same policy gradient theory in the ML track
- prob-08-variance — The critic baseline reduces gradient variance
- ml-25-neural-networks — Actor and critic are two neural networks
- prob-01-intro