Causal Calculus

Causal Reinforcement Learning

A standard bandit cannot distinguish 'ads work' from 'wealthy users see ads and buy anyway'. A causal bandit uses do-calculus: it estimates $Q^*(a) = E[R|\mathrm{do}(A=a)]$ instead of $E[R|A=a]$. This eliminates confounding and reduces regret from $O(\sqrt{|\mathcal{A}|T})$ to $O(\sqrt{dT\log T})$.

Ad personalization: causal effect of showing an ad, not correlation with a click
Adaptive clinical trials: selecting the next treatment based on do-estimates
Recommendation systems: deconfounding popularity from personal preference
Autonomous agents: planning with knowledge of the environment's causal structure
A/B testing under interference: network effects violate SUTVA

Цели урока

Formalize the causal Q-function via the do-operator and distinguish it from standard Q-learning
Understand importance-sampling (IS) correction for offline policy evaluation
Compare regret $O(\sqrt{dT\log T})$ of causal bandits with $O(\sqrt{|\mathcal{A}|T})$ of standard UCB

Предварительные знания

Do-calculus and the backdoor criterion
Multi-armed bandits: UCB, Thompson Sampling, regret
Q-learning and the Bellman equation

Causal Q-function

Standard Q-learning estimates $Q(s,a) = E[R|S=s, A=a]$ - the observational correlation. Under confounding (unobserved variables influencing both policy and reward), this is biased. Causal Q: $Q^*(s,a) = E[R|S=s, \mathrm{do}(A=a)]$ - the true intervention effect. They coincide when there is no confounding.

Importance-sampling correction

Offline policy evaluation of $\pi$ using data from behavioral policy $\mu$: $V^\pi = E_{\mu}\left[\frac{\pi(A|S)}{\mu(A|S)} R\right]$. IS weights $w = \pi/\mu$ correct the distribution. When $\pi$ deviates strongly from $\mu$, variance grows; the doubly-robust estimator reduces it via an additive outcome model correction.

Causal bandit regret

Standard UCB bandit with $|\mathcal{A}|$ actions: regret $O(\sqrt{|\mathcal{A}|T\log T})$. With a known causal structure (DAG) with $d$ parent variables of the action, causal UCB achieves $O(\sqrt{dT\log T})$. When $d \ll |\mathcal{A}|$, the gain is large: structural knowledge shrinks the search space.

Causal Q-function and Do-Calculus in MDP

Lu et al. (2021) showed: if the environment is described by an SCM, the causal Q-function Q^causal(s,a) = E[sum gamma^t R_t | do(A=a), S=s] removes the bias in standard Q(s,a) = E[R|S=s, A=a] caused by confounders influencing action selection. The difference is critical in off-policy learning: biased reward implies a suboptimal policy.

How does the causal Q-function differ from the standard Q(s,a)?

Causal Agent Regret and Environment SCM

Lattimore et al. (2016) showed: an agent with known causal structure achieves regret O(sqrt(dT log T)) where d is the DAG dimension, vs O(sqrt(|A|T)) for a naive bandit. With |A| = 100 and d = 5 this is a sqrt(20) = 4.5x reduction. The agent builds an SCM of the environment and chooses interventions on parents of the target variable rather than directly.

What regret does an agent with known causal environment structure of DAG dimension d achieve?

Causal bandit: two arms with a confounder

Confounder $Z$ affects action $A$ (ads shown to wealthy users) and reward $R$ (wealthy users buy more). Observational estimate $E[R|A=1] > E[R|A=0]$ even if ads do not work. Backdoor criterion: $E[R|\mathrm{do}(A=a)] = \sum_z E[R|A=a, Z=z]P(Z=z)$ - averaging over $Z$ removes the bias.

Итоги

Causal Q $= E[R|\mathrm{do}(A=a)]$ differs from observational $E[R|A=a]$ under policy confounding
IS correction $w = \pi/\mu$ provides unbiased offline policy evaluation; DR estimator reduces variance
Causal UCB with DAG structure: regret $O(\sqrt{dT})$ vs $O(\sqrt{|\mathcal{A}|T})$ when $d \ll |\mathcal{A}|$

Connections to other topics

Causal RL unifies structural causal models with reinforcement learning theory. Related directions: offline RL with deconfounding, world models based on SCMs for planning, causal inference in recommendation systems for popularity-bias elimination.

Related topics — extends

Вопросы для размышления

When does the causal Q coincide with the standard Q? What structural condition guarantees this?
IS weights can be very large when policies diverge strongly. How does weight clipping affect bias and variance?
If the DAG is unknown, can you simultaneously discover structure and minimize regret? What trade-off arises?