Reinforcement Learning

Actor-Critic: A2C, A3C

REINFORCE estimates gradients from complete episodes - the signal is unbiased but noisy. Every action is praised or blamed equally for the episode outcome, even accidental ones. Actor-Critic introduces a Critic network that evaluates each state: now the Actor gets per-step feedback instead of episode-level noise. The result is faster, stabler learning.

**OpenAI Five** (Dota 2) used an A3C variant with thousands of parallel environments
**A2C** is a standard baseline in modern RL libraries (Stable-Baselines3, CleanRL)
**GAE** is used in PPO, TRPO and most modern actor-critic algorithms
**Entropy bonus** is a standard component in all policy gradient methods

From Barto-Sutton-Anderson to A3C

The actor-critic idea dates to 1983, when Andrew Barto, Richard Sutton, and Charles Anderson published "Neuronlike adaptive elements that can solve difficult learning control problems", training the pole-balancing cart with a separate action unit (actor) and an adaptive critic. The architecture stayed mostly academic until 2016, when Volodymyr Mnih and colleagues at DeepMind published "Asynchronous Methods for Deep Reinforcement Learning", introducing A3C - many actor-learners running in parallel environments to decorrelate experience without a replay buffer. A2C is the synchronous variant that waits for all workers before a single batched update, and it usually matches A3C while being simpler and more GPU-friendly.

Предварительные знания

Advantage Function: why raw reward is not enough

REINFORCE weights every gradient step by the total return G_t. Every action in an episode shares the same weight, regardless of whether that particular step was clever or accidental. The signal says 'good episode' but not 'good action'.

The **advantage function** fixes this: instead of the absolute reward, it measures how much better an action was than the average expectation for that state. A(s, a) = Q(s, a) - V(s). If the agent expected +5 and received +7, advantage = +2. That number is far more informative than +7 alone.

**Baseline trick:** V(s) acts as a baseline. Subtracting it does not bias the gradient estimator (any baseline independent of the action leaves the expectation unchanged) but dramatically reduces variance. This is provably optimal in the class of linear baselines.

Advantage A(s,a) = Q(s,a) - V(s). When A > 0, this means:

Parallel training in A3C

A central problem in RL is **strong correlation between consecutive transitions**. An agent training on a single environment collects highly similar experiences, which causes the network to overfit to local patterns and lose generalisation.

**A3C (Asynchronous Advantage Actor-Critic)**, introduced by DeepMind in 2016, runs N independent workers, each in its own environment copy. Every worker collects experience and computes gradients asynchronously, updating a shared **global** network. Correlation breaks naturally - different environments yield diverse experiences.

**A2C** is the synchronous variant: all workers collect one batch simultaneously, then the model updates once. Simpler to implement, no race conditions, GPU-friendly. In practice it often matches A3C in final performance.

Property	A2C (synchronous)	A3C (asynchronous)
Workers	Wait for each other	Update independently
GPU efficiency	High (single batch)	Lower (many small updates)
Implementation	Simpler	Complex (race conditions)
Sample throughput	Limited by slowest env	Higher on CPU-bound envs

The main reason to use multiple parallel environments in A3C/A2C is:

GAE: Generalized Advantage Estimation

Estimating advantage involves a bias-variance tradeoff. 1-step TD has low variance but biased estimates (the Critic can be wrong). Monte Carlo is unbiased but has high variance. **GAE** (Schulman et al., 2015) elegantly resolves this: a geometrically weighted sum of TD errors across multiple horizons, controlled by a single parameter lambda.

GAE with lambda=0 is equivalent to:

Entropy Bonus for exploration

Actor-Critic without additional measures tends toward **premature convergence**: a reasonably good solution is found and the policy collapses to near-deterministic, cutting off exploration. The entropy bonus is a simple and effective counter-measure.

Policy entropy H(pi(s)) = -sum pi(a|s) log pi(a|s) is maximal for a uniform distribution and zero for a deterministic one. Adding it to the loss with a negative sign (maximising entropy) penalises overconfidence and keeps the policy exploring.

**Tuning entropy_coef:** too large keeps the policy nearly uniform - the agent never converges. Too small and the policy collapses to deterministic. Starting at 0.01 is a reasonable default. In PPO, entropy_coef is often annealed toward zero as training matures.

With a very large entropy_coef, the training outcome is: