Statistics

Difference-in-Differences

Цели урока

Understand the 2x2 design and the regression form of DiD
Know the parallel trends assumption and how to check it indirectly
Distinguish classic DiD from staggered designs and spot TWFE pitfalls
Apply modern estimators such as Callaway-Sant'Anna and Sun-Abraham
Master diagnostics: placebo tests, event studies, synthetic control

Предварительные знания

Basic notions of causality and potential outcomes
Linear regression with fixed effects
Panel data analysis and clustered standard errors

When randomization is impossible, two differences replace one experiment - and that simple idea remade labor economics.

**Economic policy**: evaluating minimum-wage hikes, tax credits, and trade tariffs across states and countries
**Health care**: the effect of Medicaid expansion - comparing adopting states with non-adopters
**ML systems**: rollout of a new app version to one geography with matched control markets
**Epidemiology**: Snow's classic cholera study - the first empirical DiD, 140 years before formal methodology
**Finance**: the effect of Basel rules on bank lending - countries before and after adoption

From Snow to Goodman-Bacon

John Snow's 1854 cholera investigation in London is often called the first DiD study. He compared mortality in households served by different water companies before and after Lambeth Company changed its intake. The formal econometric framework took shape in the 1980s and 1990s. The Card and Krueger 1994 paper on the New Jersey minimum wage turned DiD into the default tool of policy evaluation. The staggered-adoption problem was uncovered by Goodman-Bacon in a 2018 working paper and formally solved in 2021 by Callaway-Sant'Anna, Sun-Abraham, and de Chaisemartin-D'Haultfoeuille.

The 2x2 design

It is 1992. The governor of New Jersey raises the minimum wage from 4.25 to 5.05 dollars per hour. Every economics textbook predicts unemployment will rise: when the price of labor goes up, demand should fall. David Card and Alan Krueger propose an unexpected move: compare New Jersey fast-food restaurants with restaurants in neighboring Pennsylvania, where the minimum wage did not change. Before and after, in one group and in the other. Employment in New Jersey did not drop - it actually rose slightly relative to Pennsylvania. This 1994 paper, which earned Card the 2021 Nobel Prize, shattered the consensus and turned difference-in-differences into the workhorse of empirical social science.

DiD compares differences of differences: the change over time in the treated group minus the change over time in the control group. A simple 2x2 table strips out both stable group differences and common time trends.

Group	Pre	Post	Difference
Treated	Y_T0	Y_T1	Y_T1 - Y_T0
Control	Y_C0	Y_C1	Y_C1 - Y_C0
DiD	-	-	(Y_T1 - Y_T0) - (Y_C1 - Y_C0)

The regression form of this design is compact and convenient for adding controls and fixed effects:

The interaction coefficient δ is the DiD estimate of the causal effect. β captures the stable difference between groups, γ captures the common time trend.

Card and Krueger on minimum wage

What the numbers showed

In New Jersey the average number of workers per restaurant rose from 20.4 to 21.0 after the minimum-wage hike. In Pennsylvania over the same window employment fell from 23.3 to 21.2. DiD effect: (21.0 - 20.4) - (21.2 - 23.3) = 2.7 workers per restaurant in favor of the treated group. A decades-long consensus on the negative employment effect of the minimum wage collapsed.

Netflix, Airbnb, and large retailers routinely use DiD to evaluate a new feature. The launch is rolled out in one geography, while matched markets with similar pre-launch dynamics serve as controls. The difference-in-differences isolates the feature effect from seasonality and industry-wide shifts.

What exactly does DiD remove that a simple before-after comparison does not?

A before-after comparison mixes the treatment effect with any general time shift in the economy. DiD subtracts the time trend of the control group and isolates the treatment effect.

The parallel trends assumption

DiD comes alive only under one key assumption: in the absence of treatment, both groups would have moved in parallel. Otherwise the post-treatment gap may reflect divergent trajectories rather than the treatment effect. This assumption cannot be tested directly - it concerns the counterfactual world in which the treated group is not treated. It can, however, be checked indirectly by asking whether the groups moved in parallel before treatment.

Parallel trends is an identification claim, not a statement about reality. If the two lines stayed parallel before the intervention, there is reason to believe they would have continued that way.

Event study

How to look at trends

Instead of a single pre point and a single post point, a regression is run with indicators of relative time: -3 years, -2, -1, 0, +1, +2, +3 from treatment. Coefficients before 0 close to zero indicate parallel trends. Rising coefficients after 0 reveal a dynamic effect. This format has become a standard in top journals after the work of Dobbie-Waldfogel and Clarke.

If treated regions were growing faster before treatment, plain DiD will assign that growth to the intervention. The COVID shock made geographic divergence dramatic. Fixes: synthetic control (Abadie 2010), staggered DiD with heterogeneous effects, and group-specific time trends added to the regression.

Pick a fake treatment date before the actual intervention. DiD should return zero. If it shows a significant effect where none can exist, the assumption itself is in question.

Sometimes parallelism has to be enforced through matching: for every treated unit a control is selected with similar pre-trend dynamics. DiD is then applied to the matched sample. This is the popular matching+DiD workflow used in labor-market and health-policy research.

Comparing a cohort of users who registered in January with a cohort from March can violate parallel trends - later cohorts have different motivations and a different environment. Network effects in social products create the same issue: the control group gets contaminated through friends in the treatment group.

The parallel trends assumption cannot be tested directly. What is done instead?

The assumption itself concerns the counterfactual and cannot be tested. Indirect checks include pre-trend parallelism, placebo tests, and event studies with relative time.

Staggered adoption and TWFE

Real-world policy is rarely introduced for everyone on the same day. US states legalize cannabis in different years, countries impose lockdowns weeks apart, banks roll out new credit rules in stages. When different units receive treatment at different times, the classic regression with two-way fixed effects (TWFE) turns causal inference into a minefield.

TWFE - regression with unit and time fixed effects plus a treatment indicator. Until 2018 it was widely considered a natural multi-period extension of DiD. Goodman-Bacon (2021) showed that it is a weighted average of many pairwise DiD comparisons, and under heterogeneous effects the coefficient can take the wrong sign.

The trouble: the coefficient δ from this regression is a weighted average of all pairwise DiD comparisons across cohorts, and some weights are negative. An earlier-treated cohort acts as control for a later-treated cohort and vice versa. Under heterogeneous dynamic effects this mixing produces pure confusion.

Goodman-Bacon decomposition

What hides inside TWFE

With three cohorts (early, middle, late) and a treatment effect that grows over time, TWFE will use the early cohort as control for the late one. But by the time the late cohort is treated, the effect in the early cohort is already growing - the control is itself moving. The result: negative weights, and the final estimate may be biased toward the wrong sign.

Method	Staggered treatment	Heterogeneous effects	Year
TWFE	formally works	breaks	pre-2018
Callaway-Sant'Anna	correct	correct	2021
Sun-Abraham	correct	correct	2021
Borusyak-Jaravel-Spiess	correct	correct	2024

Callaway and Sant'Anna (2021) proposed estimating ATT(g, t) for each cohort-period combination separately and then aggregating with carefully chosen weights. Sun and Abraham (2021) provided the analogous correction for event studies. Borusyak, Jaravel, and Spiess (2024) built an imputation estimator that reconstructs counterfactual values for every treated observation. All three remove the negative weights.

A platform changes moderation rules across 50 states over a year. Naive TWFE will give a strange result thanks to heterogeneity and varied launch timing. The correct Callaway-Sant'Anna estimator reveals the genuine effect for each launch cohort - and that often flips the rollout decision.

The literature of 2018-2024 has effectively rewritten the methodology of panel data. What stood as the standard for decades turned out to be a trap, and many older results have been revisited. It is a rare case where statistical theory caught up with practice thirty years late.

What did Goodman-Bacon show about TWFE under staggered treatment?

In 2021 Goodman-Bacon proved that the TWFE estimate under staggered adoption is a weighted average of many pairwise DiD comparisons across cohorts. Under heterogeneous dynamic effects some weights turn negative, and the final estimate can carry the wrong sign.

Applications and diagnostics

The most famous DiD study predates the methodology by 140 years. In London in 1854 John Snow compared cholera mortality in houses served by two water companies - Southwark & Vauxhall (dirty water from the Thames) and Lambeth (clean water from upstream). Before 1849 both companies took water from the polluted stretch of the river; after that, Lambeth moved its intake upstream. Comparing mortality before and after across houses served by different companies is essentially DiD - 130 years before econometrics formalized the method.

Snow's cholera map

The first empirical DiD

Houses served by Southwark & Vauxhall had 315 deaths per 10000 households. Lambeth houses had 37 per 10000 - an 8.5x gap. The picture becomes especially compelling when the same neighborhoods are compared before and after Lambeth changed its intake. This contrast helped reject the miasma theory and establish waterborne cholera transmission decades before Koch.

Modern requirements for a DiD study: demonstrate pre-trend parallelism, run placebo tests on pre-periods, check sensitivity to alternative control groups, and show stability under different time windows.

Diagnostic	What it checks	What to do on failure
Pre-trend event study	parallelism before treatment	matching, synthetic control
Placebo on pre-periods	false effects	respecify the model
Alternative controls	robustness	weight the controls
Goodman-Bacon decomposition	weights in TWFE	Callaway-Sant'Anna

When a suitable control group simply does not exist (think the reunification of Germany or Brexit), synthetic control (Abadie et al., 2010, 2015) is used. It builds a weighted combination of available controls so the weighted series matches the treated unit as closely as possible in the pre-treatment window. It is DiD on steroids for the single-treated-unit case.

When many controls exist but a direct match is poor, a two-step procedure helps: first propensity score matching on pre-treatment characteristics, then DiD on the matched sample. This combines selection-on-observables with the removal of time-invariant confounders.

Measuring the impact of an app redesign on retention through DiD: users who got the update earlier form the treatment cohort, the rest are control. On user-level panel data an event study with relative time since update install is computed. The method resists seasonality and platform-wide shifts in a way that simple before-after cannot.

Over thirty years DiD evolved from an elegant teaching trick into an entire industry of methods with dozens of variations. The core idea remained the same - subtract away the rest through a double difference - but the technical scaffolding changed radically. A modern empirical researcher must master both basic DiD and its staggered extensions.

Which tool is used when a suitable control group for DiD simply does not exist?

Synthetic control (Abadie et al. 2010, 2015) builds an artificial control from a weighted combination of available units so that it matches the treated unit in the pre-treatment period. It is the answer to unique cases such as German reunification or Brexit.

Where DiD connects with the course

Difference-in-differences relies on regression technique and panel data and branches into many extensions.

Randomized Controlled Trials — DiD approximates an RCT in observational data when the parallel trends assumption holds
Regression Discontinuity Design — DiD exploits time variation; RDD exploits spatial/threshold variation - complementary identification strategies

Key ideas

DiD subtracts the control group's time trend from the treated group's time trend
Regression form: the interaction Treated x Post coefficient is the effect estimate
Parallel trends cannot be tested directly; pre-trend behavior is the indirect check
Event studies with relative time are the standard for visualizing dynamic effects
TWFE under staggered treatment yields biased estimates because of negative weights
Modern estimators (Callaway-Sant'Anna, Sun-Abraham) handle heterogeneity correctly
Synthetic control covers the single-treated-unit case
Matching + DiD is the standard workflow for observational research

Вопросы для размышления

Which product experiment could be reframed as a DiD with a matched control group?
If parallel trends fail in the pre-period, which alternatives are still on the table?
Why can TWFE produce the wrong sign under staggered treatment?
How does synthetic control differ from plain matching?
How do network effects in social products threaten the DiD conclusion?

Связанные уроки

stat-40-causal-rct — DiD is an alternative to randomization
stat-43-causal-iv — different identification strategies
stat-39-causal-confounders — DiD removes time-invariant confounders
stat-45-causal-rdd — neighboring approaches to local identification
la-06-gauss