Machine Learning

A/B Testing ML Models

In 2012, one A/B test at Microsoft (Bing) generated $100 million in annual revenue - simply by changing the shade of a link color. But incorrect experiments cost even more: Netflix once lost millions after rolling out a model that passed all offline tests but failed with real users. Between offline metrics and real user behavior lies a chasm, and A/B testing is the only bridge across it.

  • **Search engines (Google, Bing, Yandex)** - run tens of thousands of A/B tests and interleaving experiments annually to evaluate ranking changes: every search algorithm update goes through a multi-level funnel of offline -> interleaving -> A/B -> gradual rollout
  • **Recommendation systems (Netflix, Spotify, YouTube)** - use multi-armed bandit to optimize homepage content: Thompson Sampling dynamically redistributes impressions between thumbnail, title, and recommendation variants, minimizing losses on ineffective options
  • **E-commerce (Amazon, Booking, Ozon)** - A/B test pricing models, product ranking, and personalization: every experiment is accompanied by sample size calculations, guardrail metric monitoring, and correction for multiple comparisons (Bonferroni/FDR)

Предварительные знания

  • Search & Ranking

From Fisher's field trials to online controlled experiments

The statistical backbone of A/B testing comes from Ronald A. Fisher, who in the 1920s and 1930s developed randomized controlled experiments while working on agricultural trials at Rothamsted. His 1935 book The Design of Experiments established randomization, replication, and the null hypothesis as the foundations of valid causal inference, the same ideas that justify splitting users into control and treatment groups today. Decades later the web turned these ideas into a daily engineering practice. In the 2000s Ron Kohavi, working at Microsoft and earlier at Amazon, championed online controlled experiments at internet scale, building the infrastructure and methodology that let companies run thousands of A/B tests at once. Kohavi documented surprising results, such as how tiny interface changes could move revenue by millions, and how teams routinely overestimate their own ideas. His work made experimentation a core competency of modern product and ML teams, where every ranking model or recommendation change ships behind a controlled experiment.

Online Evaluation

You trained a model and checked metrics on the test set - accuracy 95%, NDCG@10 improved by 3%. Looks like time to deploy. But here's the problem: **offline metrics and online metrics often diverge**. A model that beats baseline on AUC may actually hurt CTR with real users. The reason - the offline dataset reflects the past, but user behavior changes. Moreover, offline metrics don't account for response latency, visual element placement, and interaction context.

That's why production ML systems use three levels of metrics. **Business metrics** - the reason the product exists: revenue, subscriptions, time in app. **Product metrics** - intermediate indicators: CTR, conversion, scroll depth. **Guardrail metrics** - constraints that must not be violated: latency (response time), crash rate, user churn. A new model may improve CTR, but if latency grew from 50ms to 500ms, the experiment has failed.

**Three levels of metrics for online experiments:** **Business metrics** (Overall Evaluation Criteria): - Revenue per User - Number of purchases / subscriptions - Retention (returning users) **Product metrics** (direct indicators): - CTR (Click-Through Rate) - Conversion Rate - Average session time - Actions per session **Guardrail metrics** (must not degrade): - Latency (p50, p95, p99) - Error rate / crash rate - Churn rate - Unsubscribe rate If a guardrail is violated - the experiment is stopped, even if business metrics improved.

The key principle of A/B testing is **randomization**: users are randomly distributed between variants, and each user sees only one variant throughout the entire experiment. Deterministic hashing (hash of user_id) guarantees that the same user always lands in the same group. Without this, results will be noisy: a user will see different models across different sessions and their behavior cannot be correctly attributed.

**Network effects and SUTVA:** A/B testing assumes SUTVA (Stable Unit Treatment Value Assumption) - one user's outcome doesn't depend on which group other users are in. But in social networks this is violated: if a user's friend from the treatment group shares content found by the new algorithm, it affects the control-group user. Solutions: cluster-based randomization (randomize by friend clusters), geo-based experiments (different cities/regions), or switchback experiments (alternating variants over time).

A new ranking model showed NDCG@10 = 0.82 vs 0.78 for baseline on offline data. However, in an A/B test CTR dropped from 4.1% to 3.7%, and latency grew from 45ms to 120ms. What is the right decision?

Interleaving

The classic A/B test for ranking has a fundamental problem: **it's very slow**. A user sees results from only one model, and the difference between models may be smaller than the difference between users themselves. To detect a 1% CTR improvement, you might need a million users and 2 weeks of experiment. **Interleaving** solves this problem elegantly: instead of showing each user results from only one model, we blend results from two models into one list and observe which model's results the user clicks more often.

The most popular interleaving method is **Team Draft**. Think of two models as team captains at a gym class. They take turns 'picking' results into a shared list. Model 1 places its best result first, then Model 2 places its best (if not already picked), and so on. The user sees the combined list without knowing which result came from which model. Clicks determine the winner.

**Why is interleaving 10-100x more sensitive than A/B testing?** In A/B testing we compare **between users**: - User X (control): CTR = 3.2% - User Y (treatment): CTR = 3.5% - Difference 0.3%, but variance between users is huge (0% to 20%+) In interleaving we compare **within one user**: - The same user sees results from both models - Their preferences, context, and timing are the same for both - Variance drops dramatically This is a paired comparison - like in medicine: giving one patient both drugs (at different times) is more accurate than giving different drugs to different patients. But interleaving only works for **ranking tasks** (search, recommendations), where results from two models can be mixed.

Interleaving has limitations. It only works for **ranking tasks** where results from two models can be merged into one list. For tasks like pricing, UI changes, or chatbots, interleaving is not applicable. Additionally, interleaving is good at determining **which model is better**, but poor at measuring **by how much** (absolute effect size). That's why in practice, interleaving is used for rapid screening of candidate models, while the final rollout decision is made based on a full A/B test.

Why does interleaving require 10-100x less traffic than a classic A/B test to detect the same difference between models?

Multi-Armed Bandit

The classic A/B test has an unpleasant property: while the experiment runs, half the traffic goes to the directly worse variant. If the new model is better, we're losing money on the control group. If it's worse - we're losing on the treatment group. **Multi-Armed Bandit (MAB)** solves this: instead of a fixed 50/50 split it dynamically reallocates traffic toward the better variant while continuing to collect data on the worse one. This is a tradeoff between **exploitation** (use the best known option) and **exploration** (investigate the unknown).

The name 'multi-armed bandit' is a metaphor from probability theory. Picture a row of slot machines (one-armed bandits) with different win probabilities. You don't know these probabilities in advance. Each time you choose a machine and receive a reward. The goal is to maximize total winnings. If you always play the same machine - you might be missing the best one. If you try all of them in sequence - you waste money on bad ones. You need a balance.

**Three main MAB strategies:** **Epsilon-Greedy:** - With probability (1 - epsilon): choose the best variant (exploit) - With probability epsilon: choose a random variant (explore) - Simple but not adaptive - epsilon is fixed **UCB (Upper Confidence Bound):** - For each variant compute: mean reward + uncertainty bonus - Rarely-chosen variants get a larger bonus - Deterministic - no randomness in selection **Thompson Sampling:** - Maintain a probability distribution for each variant - Sample from each distribution, choose the maximum - Distributions narrow as data accumulates - Often the best strategy in practice

**When to use MAB vs A/B test?** MAB is optimal when the cost of error is high in real time (ads, pricing, recommendations), when there are many variants (dozens of email subject lines), or when a quick result is needed. A/B testing is better when **statistical rigor** is required: MAB doesn't provide clean p-values and confidence intervals because traffic distribution changed during the experiment. If you need to prove to stakeholders that model B is better than model A with 95% confidence - use an A/B test.

You have 20 email subject line variants and need to quickly find the best one while minimizing losses on bad variants. What method is optimal?

Statistical Significance

You launched an A/B test. After a week you see: control group CTR 3.2%, treatment group CTR 3.4%. A 0.2 percentage point difference. The question is: is this a real improvement or random fluctuation? **Statistical significance** is the formal way to answer this question. We formulate a **null hypothesis** (H0): 'there is no difference between the models, the observed difference is random.' Then we compute the **p-value** - the probability of obtaining this or a greater difference if the null hypothesis is true.

Before launching an experiment, you need to determine the **required sample size**. A sample that's too small - and you won't detect a real improvement (low statistical power). Too large - a waste of traffic and time. Sample size calculation depends on three parameters: **Minimum Detectable Effect (MDE)** - the minimum difference you want to detect, **significance level (alpha)** - p-value threshold (usually 0.05), and **statistical power (1 - beta)** - probability of detecting an effect if it exists (usually 0.80).

**Multiple Testing Problem:** If you test 20 metrics simultaneously with alpha = 0.05, the probability of getting at least one false positive: 1 - (1 - 0.05)^20 = 64%! **Bonferroni correction** - the simplest fix: alpha_adjusted = alpha / n_tests. - 20 metrics: alpha_adjusted = 0.05 / 20 = 0.0025 - Each metric must pass the 0.0025 threshold, not 0.05 Bonferroni is conservative (may miss real effects). Alternatives: - **Benjamini-Hochberg (FDR)** - controls false discovery rate, less conservative - **Sequential testing (mSPRT)** - allows checking results at any point without penalty In practice for ML experiments there are usually 1-2 primary metrics (with full alpha) and 5-10 secondary metrics (with Bonferroni or FDR correction).

Finally, it's critically important to distinguish **statistical significance** from **practical significance**. A result can be statistically significant (p < 0.05) but practically useless. If CTR grew from 3.000% to 3.005%, the p-value can be tiny with millions of observations, but an increase of 0.005 percentage points doesn't justify the cost of deploying and maintaining a new model. Always evaluate the **absolute effect size** and its business impact, not just the p-value.

p < 0.05 means the new model is better with 95% probability

p-value is the probability of observing the data (or more extreme) given that the null hypothesis is true (models are equal). It is not the probability that the model is better

p-value answers P(data | H0), not P(H1 | data). To estimate the probability that a model is truly better, you need a Bayesian approach with a prior distribution. In practice p < 0.05 means: 'it's unlikely that such a difference is random,' but does NOT mean: 'there's a 95% chance treatment is better.' Confusing these two statements is one of the most common errors in data science.

An A/B test of a new model showed: CTR control = 4.10%, CTR treatment = 4.12%, p-value = 0.03, sample size = 5 million per group. What is the correct conclusion?

Key takeaways

  • **Online evaluation:** offline metrics (AUC, NDCG) often diverge from online results - production evaluation requires business metrics (CTR, revenue), product metrics (conversion), and guardrail metrics (latency, error rate) that must not be violated
  • **Interleaving:** for ranking tasks it's 10-100x more sensitive than A/B testing because it compares models within a single user (paired comparison), eliminating inter-user variance - ideal for rapid screening of candidate models
  • **Multi-Armed Bandit:** instead of a fixed 50/50 split it dynamically redistributes traffic toward the better variant via Thompson Sampling - the exploration/exploitation balance minimizes losses, but doesn't provide strict p-values
  • **Statistical significance:** p-value is the probability of the data under the null hypothesis, not the probability that the model is better; with large sample sizes even a tiny difference yields p < 0.05, so always assess practical significance - like that Bing link color shade that turned out to be worth $100 million

Related topics

A/B testing of ML models connects production system monitoring with the decision-making process for rollouts:

  • ML Monitoring — Monitoring is continuous observation of a model after rollout, while A/B testing is a one-time comparison before rollout. Guardrail metrics from an A/B test become monitoring metrics in production. If monitoring shows degradation - a new experiment is launched
  • Search & Ranking — Interleaving is the primary method for rapidly evaluating ranking models. Metrics like NDCG@10 serve as an offline filter, but the final decision to roll out a new ranking is always based on online experiment results with real users

Вопросы для размышления

  • Why do companies like Google and Microsoft run tens of thousands of A/B tests per year rather than trusting offline metrics? What real-world factors can't a test dataset capture?
  • In what situations might a multi-armed bandit give worse results than a classic A/B test? Hint: think about non-stationarity and delayed rewards.
  • If p-value = 0.001, but the absolute CTR difference is 0.01 percentage points - should you roll out the model? How can you formalize the concept of 'practical significance'?

Связанные уроки

  • ml-47-model-monitoring — Monitoring metrics feed A/B decisions
  • ml-52-search-ranking — Ranking changes are validated via A/B tests
  • ml-05-evaluation — Offline metrics complement online tests
  • stat-05-hypothesis — A/B testing is hypothesis testing
  • stat-19-multiple-testing — Multiple metrics need correction
A/B Testing ML Models

0

1

Sign In