Causal Calculus
Causal Reinforcement Learning
A standard bandit cannot distinguish 'ads work' from 'wealthy users see ads and buy anyway'. A causal bandit uses do-calculus: it estimates $Q^*(a) = E[R|\mathrm{do}(A=a)]$ instead of $E[R|A=a]$. This eliminates confounding and reduces regret from $O(\sqrt{|\mathcal{A}|T})$ to $O(\sqrt{dT\log T})$.
- Ad personalization: causal effect of showing an ad, not correlation with a click
- Adaptive clinical trials: selecting the next treatment based on do-estimates
- Recommendation systems: deconfounding popularity from personal preference
- Autonomous agents: planning with knowledge of the environment's causal structure
- A/B testing under interference: network effects violate SUTVA
Цели урока
- Formalize the causal Q-function via the do-operator and distinguish it from standard Q-learning
- Understand importance-sampling (IS) correction for offline policy evaluation
- Compare regret $O(\sqrt{dT\log T})$ of causal bandits with $O(\sqrt{|\mathcal{A}|T})$ of standard UCB
Предварительные знания
- Do-calculus and the backdoor criterion
- Multi-armed bandits: UCB, Thompson Sampling, regret
- Q-learning and the Bellman equation
Causal Q-function
Standard Q-learning estimates $Q(s,a) = E[R|S=s, A=a]$ - the observational correlation. Under confounding (unobserved variables influencing both policy and reward), this is biased. Causal Q: $Q^*(s,a) = E[R|S=s, \mathrm{do}(A=a)]$ - the true intervention effect. They coincide when there is no confounding.
Importance-sampling correction
Offline policy evaluation of $\pi$ using data from behavioral policy $\mu$: $V^\pi = E_{\mu}\left[\frac{\pi(A|S)}{\mu(A|S)} R\right]$. IS weights $w = \pi/\mu$ correct the distribution. When $\pi$ deviates strongly from $\mu$, variance grows; the doubly-robust estimator reduces it via an additive outcome model correction.
Causal bandit regret
Standard UCB bandit with $|\mathcal{A}|$ actions: regret $O(\sqrt{|\mathcal{A}|T\log T})$. With a known causal structure (DAG) with $d$ parent variables of the action, causal UCB achieves $O(\sqrt{dT\log T})$. When $d \ll |\mathcal{A}|$, the gain is large: structural knowledge shrinks the search space.
Causal Q-function and Do-Calculus in MDP
Lu et al. (2021) showed: if the environment is described by an SCM, the causal Q-function Q^causal(s,a) = E[sum gamma^t R_t | do(A=a), S=s] removes the bias in standard Q(s,a) = E[R|S=s, A=a] caused by confounders influencing action selection. The difference is critical in off-policy learning: biased reward implies a suboptimal policy.
How does the causal Q-function differ from the standard Q(s,a)?
Causal Agent Regret and Environment SCM
Lattimore et al. (2016) showed: an agent with known causal structure achieves regret O(sqrt(dT log T)) where d is the DAG dimension, vs O(sqrt(|A|T)) for a naive bandit. With |A| = 100 and d = 5 this is a sqrt(20) = 4.5x reduction. The agent builds an SCM of the environment and chooses interventions on parents of the target variable rather than directly.
What regret does an agent with known causal environment structure of DAG dimension d achieve?
Causal bandit: two arms with a confounder
Confounder $Z$ affects action $A$ (ads shown to wealthy users) and reward $R$ (wealthy users buy more). Observational estimate $E[R|A=1] > E[R|A=0]$ even if ads do not work. Backdoor criterion: $E[R|\mathrm{do}(A=a)] = \sum_z E[R|A=a, Z=z]P(Z=z)$ - averaging over $Z$ removes the bias.
Итоги
- Causal Q $= E[R|\mathrm{do}(A=a)]$ differs from observational $E[R|A=a]$ under policy confounding
- IS correction $w = \pi/\mu$ provides unbiased offline policy evaluation; DR estimator reduces variance
- Causal UCB with DAG structure: regret $O(\sqrt{dT})$ vs $O(\sqrt{|\mathcal{A}|T})$ when $d \ll |\mathcal{A}|$
Connections to other topics
Causal RL unifies structural causal models with reinforcement learning theory. Related directions: offline RL with deconfounding, world models based on SCMs for planning, causal inference in recommendation systems for popularity-bias elimination.
- Related topics — extends
Вопросы для размышления
- When does the causal Q coincide with the standard Q? What structural condition guarantees this?
- IS weights can be very large when policies diverge strongly. How does weight clipping affect bias and variance?
- If the DAG is unknown, can you simultaneously discover structure and minimize regret? What trade-off arises?