Statistics

Student's t-test: The Statistic Born in a Brewery

In 1908, a Guinness brewer named William Gosset invented the t-test but could not publish it - trade secret. He used the pseudonym "Student". Today his formula runs inside every A/B test at Booking.com, Netflix, and Google - hundreds of millions of decisions per year.

A/B tests at Booking.com and Netflix: the t-test is the default experiment engine
FDA clinical trials: two-sample Welch t-test for drug vs placebo comparison
Google Ads: conversion rate comparison between ad variants
Sklearn feature selection: SelectKBest with f_classif uses t-test statistics
Paired t-test: measuring fine-tuning improvement on the same eval set
Quality control: does a production batch meet the specification mean?

Предварительные знания

(no prerequisites)

Hypothesis Testing: How p-values Killed 64,000 Studies

Why the Normal Distribution Is Not Enough

**Dublin, 1906. William Sealy Gosset works as a chemist at the Guinness brewery.** His task: select the best barley variety. Experiments are expensive - only 5-10 samples are feasible. The standard statistics of the time required a "sufficiently large sample". What does "sufficient" mean? Gosset derives an exact distribution for small n. Guinness forbids publication (trade secret). Gosset publishes the result in 1908 under the pseudonym **"Student"**. His distribution is still called the "Student's t-distribution". Student's t-test is the most widely used statistical test in the world: every A/B test, every clinical trial, every comparison of ML models. All of it - from a brewer with a pseudonym.

**What this lesson actually teaches**: not "how to plug numbers into t = (X̄ - μ₀)/(S/√n)", but **why a separate distribution is needed for small samples**. With small n, the estimate of σ is itself imprecise - and this adds extra uncertainty. The Student's t-distribution accounts for that honestly. After the lesson: three types of t-tests, Cohen's d, and why the Welch t-test should be the default.

Why the Normal Distribution Is Not Enough

The z-test uses the statistic (X̄ - μ)/(σ/√n) ~ N(0,1). This works when σ is known. But σ is almost never known - it gets replaced by the sample S. With large n, S ≈ σ and everything is fine. With small n (5-30 observations), S is itself a random variable with substantial spread. Substituting S for σ makes the ratio "heavier" - more probability in the far tails.

If X₁,...,Xₙ ~ N(μ, σ²) and σ is UNKNOWN: T = (X̄ - μ) / (S/√n) ~ t(n-1) where S = √(1/(n-1)·Σ(Xᵢ-X̄)²) - sample standard deviation (ddof=1) The t(ν) distribution with ν = n-1 degrees of freedom: - Symmetric around 0 (like the normal) - Heavier tails at small ν (honestly reflects uncertainty) - Converges to N(0,1) as ν → ∞ Comparison of critical values for 95% (two-tailed): z = 1.960 (normal, n=∞) t₀.₀₂₅,₂₉ = 2.045 (n=30) t₀.₀₂₅,₉ = 2.262 (n=10) t₀.₀₂₅,₄ = 2.776 (n=5) At n=5: the critical value is 1.42 times larger! More probability in the extremes -> harder to reject H₀. This is honest: with 5 measurements there is less information about σ.

Why must you use the t-distribution (not the normal) for a small sample (n < 30) with unknown σ?

Three Types of t-Test

1. One-Sample: Comparison Against a Constant

H₀: μ = μ₀ (true mean equals the specified value) H₁: μ ≠ μ₀ (two-tailed) T = (X̄ - μ₀) / (S/√n) ~ t(n-1) Example: API latency should be ≤ 200 ms (μ₀ = 200). Sample of 20 requests: X̄ = 215 ms, S = 30 ms. T = (215 - 200) / (30/√20) = 15 / 6.71 = 2.237 df = 19, critical t₀.₀₂₅,₁₉ = 2.093 2.237 > 2.093 -> reject H₀: latency significantly exceeds 200 ms.

2. Two-Sample: Comparing Two Groups

H₀: μ₁ = μ₂ (means of two groups are equal) H₁: μ₁ ≠ μ₂ Welch t-test (does not assume equal variances): T = (X̄₁ - X̄₂) / √(S₁²/n₁ + S₂²/n₂) Degrees of freedom (Welch-Satterthwaite formula): df = (S₁²/n₁ + S₂²/n₂)² / ((S₁²/n₁)²/(n₁-1) + (S₂²/n₂)²/(n₂-1)) Why Welch and not Student (pooled)? Student assumes σ₁ = σ₂. If that is not the case - the test is incorrect. Welch works in both cases: with equal and unequal σ. Modern standard: **use Welch by default**.

3. Paired: Before and After

Each subject is measured twice: before and after treatment. H₀: mean change = 0 Dᵢ = Xᵢ_after - Xᵢ_before T = D̄ / (S_D/√n) ~ t(n-1) Advantage: removes between-subject variability. Example: service latency before and after optimization across 15 servers. Without pairing: Var(X̄_after - X̄_before) = σ²_after/n + σ²_before/n With pairing: Var(D̄) = σ²_D/n (σ²_D << σ²_after + σ²_before) The paired test is 2-5 times more powerful than the two-sample test on paired data.

Which t-test do you use to compare service latency BEFORE and AFTER an optimization measured on the same 15 servers?

Effect Size: Statistical Significance Does Not Equal Practical Importance

The p-value depends on n: with a million users, any conversion difference of 0.0001% will be statistically significant. Cohen's d measures the **effect size in units of standard deviation** - independent of n.

Cohen's d	Interpretation	ML Example
< 0.2	Negligible	Model beats baseline by 0.01% accuracy
0.2 - 0.5	Small	New feature improves F1 by 0.5%
0.5 - 0.8	Medium	A/B: conversion 3% vs 3.5%
> 0.8	Large	Transformer vs RNN on seq2seq tasks
> 1.2	Very large	Vaccine vs placebo in COVID trials

**Production rule**: always report three numbers: p-value (is the difference significant), Cohen's d or relative lift (how important the difference is), confidence interval for the difference. A p-value alone is not enough to make a deployment decision.

An A/B test on 1 000 000 users finds a 0.01% conversion difference "statistically significant" (p < 0.001). What does that mean for shipping?

Where the t-Test Lives in Real Systems

Which approach does MODERN practice recommend by default for comparing two independent groups?

Practice: A/B Test for a Recommendation Service

A/B test: control CTR = 4.0% (n=2000), variant CTR = 4.5% (n=2000). Two-sample test p-value = 0.04. What is a sound report?

Key Takeaways

**t(ν) distribution**: heavier than normal due to uncertainty in estimating σ. At ν = n-1 → ∞ it converges to N(0,1)
**Three types**: one-sample (X̄ vs constant), two-sample Welch (two independent groups), paired (same subjects before/after)
**Welch = default**: does not assume σ₁=σ₂, works in both cases with minimal power loss
**Cohen's d**: effect size independent of n. Small (<0.5), medium (0.5-0.8), large (>0.8). Always report alongside p-value
**Paired is more powerful**: removes between-subject noise, 2-5 times more efficient when correctly applied
**Statistical significance is not practical significance**: with large n any difference is significant. Look at lift + CI + ROI

What's Next

The t-test is for numerical data. For categorical data - a different tool.

Chi-Square Test — Test for categorical data: SRM in A/B tests, distribution fit, independence
Rank Tests (Mann-Whitney) — When data are non-normal: the non-parametric alternative to the t-test
Bootstrap — t-test without normality assumptions via resampling
ANOVA — Generalization of the t-test to k > 2 groups without inflating the type I error rate

Связанные уроки

ml-53-ab-testing-ml