Statistics

Instrumental Variables

Цели урока

Understand the nature of endogeneity and its three sources
Know the three conditions of a valid instrument and how to defend them
Compute the IV estimate via covariances and via 2SLS
Distinguish LATE from ATE and recognize the role of compliers
Identify weak instruments and apply the Anderson-Rubin test

Предварительные знания

Basic notions of causality and DAGs
Potential outcomes and SUTVA
Linear regression and ordinary least squares

Does education raise wages, or do able people both study more and earn more? An instrument provides the answer.

**Labor economics**: estimating the return to education via quarter of birth, distance to college, scholarship lotteries
**Epidemiology**: Mendelian randomization - genetic variants as instruments for cholesterol levels and heart-attack risk
**ML systems**: position as an instrument for clicks in search engines and recommendation feeds
**Advertising**: random assignment to exposure groups as an instrument for actual exposure when measuring campaign lift
**Policy**: proximity to a state border with a different law as an instrument for the policy effect on employment and crime

From Wright to the Angrist Nobel

Philip Wright is credited with the first use of IV, in his 1928 study of supply and demand for flaxseed, though authorship was long attributed to his son Sewall. The modern language of structural econometrics was forged in the 1940s and 1950s by Haavelmo and Koopmans at the Cowles Commission. The credibility revolution arrived in the 1990s: Angrist and Krueger (1991) used quarter of birth, Card (1995) used proximity to college, Levitt (1997) used mayoral election cycles to instrument police hiring. In 2021 Angrist, Imbens, and Card shared the Nobel Prize in Economics for advances in causal inference from observational data.

The endogeneity problem

College graduates earn roughly 60 percent more than people with only a high-school diploma. The conclusion seems straightforward: study and the paycheck grows. But the people who attend college are not a random slice of the population - they tend to be more able, more motivated, and come from wealthier families. Their higher pay reflects not only the degree itself but also these hidden traits. A plain regression of earnings on schooling mixes the true effect of education with the effect of ability, and what comes out is a tangled correlation, not a causal answer.

Endogeneity is the situation in which an explanatory variable correlates with the error term. Three classic sources: omitted variables, reverse causation, and measurement error.

In this case ordinary least squares stays biased even in the limit:

Three classic traps

Where endogeneity breaks regression

Return to education: ability affects schooling and wage - upward bias. Price and demand: price reacts to demand, demand reacts to price - simultaneity. Hospital quality: sicker patients go to better hospitals, so naive comparisons make the best hospitals look worst.

Training a model on the signal "users who watch more videos make the feed better" is circular. Views are produced by recommendations, and recommendations are trained on views. Without an instrument (random position, for instance) the system measures its own echo, not real quality.

A source of variation in X that is unrelated to ε. If such a source exists in the data, it is called an instrument, and it rescues the regression from bias.

Why does OLS produce a biased estimate of the return to education?

This is the textbook omitted-variable example: ability affects both observed traits, creating a nonzero covariance between X and ε.

Instrument and three conditions

In 1990 Joshua Angrist and Alan Krueger spotted something strange: quarter of birth affects schooling. Children born in Q1 enter school almost a year later than children born in Q4, so by the time the law allows them to drop out they have less education. Quarter of birth is essentially random, yet it generates a small difference in years of schooling. And that difference is not tied to ability. A nearly perfect instrument - and the foundation of the 2021 Nobel Prize awarded to Angrist.

An instrument Z must satisfy three conditions: relevance, exclusion, and independence. Breaking any one of them breaks the causal claim.

Natural experiments as instruments

Where randomness comes from

Vietnam draft lottery (Angrist 1990) - instrument for military service. Distance to college (Card 1995) - instrument for schooling. Weather - instrument for crop output and prices. State-border discontinuities - instruments for policy. School lottery slots - instruments for school quality.

Quarter of birth might affect health through prenatal seasonality. If so, the channel "quarter - health - earnings" violates exclusion. A good IV paper always spends most of its space defending exactly this assumption.

In search results and recommendation feeds the position of an item is nearly random for the user yet strongly affects clicks. Position serves as an instrument for the true quality of the content, stripping clicks of attention bias.

Which of the three conditions for a valid instrument cannot be checked with a statistical test?

Relevance is verified by the strength of the first stage (F-test). Independence is partially testable. Exclusion is an argument that Z does not affect Y through any channel besides X, and it must be defended by theory rather than data.

Two-stage least squares

The idea of 2SLS is elegant: first clean the treatment of its endogenous part, then regress the outcome on the cleaned version. The first stage extracts from X the variation explained by the instrument - guaranteed exogenous. The second stage uses that clean variation to estimate the causal effect.

First-stage strength is measured by an F-statistic. The old rule of thumb is F greater than 10. The recent work of Lee, McCrary, Moreira, and Porter (2022) showed that for correct inference the threshold should be F greater than 104.7.

When Cov(Z, X) is small, the denominator of the IV formula approaches zero. Estimates explode, standard errors collapse, and confidence intervals become fiction. A weak instrument is worse than no instrument - it provides false confidence.

The Anderson-Rubin test

Robust to weak instruments

Standard 2SLS confidence intervals fail when the instrument is weak. The Anderson-Rubin test (1949) inverts a hypothesis test on β = β0 and remains valid even under extremely weak relevance. Modern packages (ivreg2, ivmodel) include this test by default.

Researchers at Facebook and YouTube use 2SLS with position as the instrument for clicks. This gives an unbiased estimate of the recommendation effect on consumption, separating it from the attention effect. The same approach measures ad effectiveness: random assignment to exposure groups serves as the instrument for actual exposure.

Approach	What it measures	When to apply
OLS	Correlation	Only without confounders
2SLS with strong instrument	LATE on compliers	When valid Z exists
2SLS with weak instrument	Noise disguised as effect	Never without correction
Anderson-Rubin test	Robust interval	Whenever strength is in doubt

What does the first stage of 2SLS do?

The first stage regresses X on Z and produces X_hat - the projection of treatment onto the instrument space, the exogenous component of X.

LATE versus ATE: the effect on compliers

The central insight of modern IV theory: the method does not estimate the average effect for the whole population but only the effect on so-called compliers - those who change their behavior in response to the instrument. This sharply limits generalization and made IV the subject of a fierce debate between Imbens and Heckman in the 1990s.

In response to instrument Z people fall into four types: always-takers (always take the treatment), never-takers (never), compliers (follow the instrument), defiers (do the opposite). IV sees only compliers.

Type	Z=0	Z=1	Visible effect?
Always-taker	X=1	X=1	No
Never-taker	X=0	X=0	No
Complier	X=0	X=1	Yes - contributes to LATE
Defier	X=1	X=0	Breaks monotonicity

For LATE to exist an extra assumption is needed: monotonicity, meaning no defiers. Under this condition the IV estimate equals the average causal effect among compliers exactly. Imbens and Angrist proved this in 1994, and it is precisely this theory that earned them the Nobel Prize alongside Card.

Charter school lottery

Who the compliers are

A lottery assigns charter-school seats at random. Families who would have found a good school anyway are always-takers. Families who would have stayed in the regular school under any outcome are never-takers. Families who attend the charter school only when they win the lottery are compliers. The charter-school effect from IV is the effect only on this marginal group and need not match the effect on the whole population.

If treatment effects are heterogeneous (and in practice they almost always are), LATE can differ sharply from the average effect across the population. Distance to college as an instrument yields a LATE for families to whom geography matters - usually lower-income households, for whom the return to education may be either higher or lower than average.

In mobile apps not every user reacts to push notifications, banners, or invitations to try a new feature. ITT (intent-to-treat) - the simple difference of means between assignment groups - gives the practical effect of a rollout. LATE from IV (assignment as the instrument for actual use) gives the effect on those who actually engaged. These are different numbers, and a product manager needs both.

When does LATE equal ATE? Only under homogeneous effects - if treatment acted identically on everyone. In that world it would not matter whom one measures. In practice that is rare, and modern IV literature always discusses who the compliers are in any specific study.

What is LATE in the context of IV?

LATE (Local Average Treatment Effect) is the causal effect only on compliers - those whose behavior shifts in response to the instrument. Without monotonicity this interpretation falls apart.

Where IV connects further

Instrumental variables are not an isolated tool but a link in the broader chain of causal analysis.

Potential Outcomes Framework — IV estimator is defined within the LATE framework - applies only to compliers, not the full population
Regression Discontinuity Design — RDD and IV share the logic of exploiting exogenous variation - a threshold vs an instrument

Key ideas

Endogeneity - correlation between X and the error - biases OLS and breaks causal inference
An instrument Z must be relevant, satisfy exclusion, and be independent
The exclusion restriction cannot be tested statistically, only argued
2SLS implements IV in two stages: clean X through Z, then regress Y on X_hat
Weak instruments (F below 10) explode variance and produce invalid confidence intervals
IV estimates LATE - the effect on compliers only, not the population ATE
Natural experiments (lotteries, weather, policy discontinuities) supply valid instruments
In ML, position serves as an instrument that cleans clicks of attention bias

Вопросы для размышления

What sources of randomness inside app data could be repurposed as instruments?
If one had to defend exclusion for quarter of birth, which alternative channels would need to be ruled out?
Why are compliers in a distance-to-college study likely to be lower-income families?
How does LATE differ from the effect a product manager cares about when evaluating a new feature?
How does a Bayesian view help interpret an IV estimate under a weak instrument?

Связанные уроки

stat-41-causal-potential-outcomes — potential outcomes are the language of causality
stat-44-causal-did — alternative identification strategy
stat-39-causal-confounders — IV handles unobserved confounders
stat-40-causal-rct — natural analog of randomization
la-06-gauss