Statistics
Student's t-test: The Statistic Born in a Brewery
In 1908, a Guinness brewer named William Gosset invented the t-test but could not publish it - trade secret. He used the pseudonym "Student". Today his formula runs inside every A/B test at Booking.com, Netflix, and Google - hundreds of millions of decisions per year.
- A/B tests at Booking.com and Netflix: the t-test is the default experiment engine
- FDA clinical trials: two-sample Welch t-test for drug vs placebo comparison
- Google Ads: conversion rate comparison between ad variants
- Sklearn feature selection: SelectKBest with f_classif uses t-test statistics
- Paired t-test: measuring fine-tuning improvement on the same eval set
- Quality control: does a production batch meet the specification mean?
Предварительные знания
- (no prerequisites)
Why the Normal Distribution Is Not Enough
**Dublin, 1906. William Sealy Gosset works as a chemist at the Guinness brewery.** His task: select the best barley variety. Experiments are expensive - only 5-10 samples are feasible. The standard statistics of the time required a "sufficiently large sample". What does "sufficient" mean? Gosset derives an exact distribution for small n. Guinness forbids publication (trade secret). Gosset publishes the result in 1908 under the pseudonym **"Student"**. His distribution is still called the "Student's t-distribution". Student's t-test is the most widely used statistical test in the world: every A/B test, every clinical trial, every comparison of ML models. All of it - from a brewer with a pseudonym.
**What this lesson actually teaches**: not "how to plug numbers into t = (X̄ - μ₀)/(S/√n)", but **why a separate distribution is needed for small samples**. With small n, the estimate of σ is itself imprecise - and this adds extra uncertainty. The Student's t-distribution accounts for that honestly. After the lesson: three types of t-tests, Cohen's d, and why the Welch t-test should be the default.
Why the Normal Distribution Is Not Enough
The z-test uses the statistic (X̄ - μ)/(σ/√n) ~ N(0,1). This works when σ is known. But σ is almost never known - it gets replaced by the sample S. With large n, S ≈ σ and everything is fine. With small n (5-30 observations), S is itself a random variable with substantial spread. Substituting S for σ makes the ratio "heavier" - more probability in the far tails.
If X₁,...,Xₙ ~ N(μ, σ²) and σ is UNKNOWN: T = (X̄ - μ) / (S/√n) ~ t(n-1) where S = √(1/(n-1)·Σ(Xᵢ-X̄)²) - sample standard deviation (ddof=1) The t(ν) distribution with ν = n-1 degrees of freedom: - Symmetric around 0 (like the normal) - Heavier tails at small ν (honestly reflects uncertainty) - Converges to N(0,1) as ν → ∞ Comparison of critical values for 95% (two-tailed): z = 1.960 (normal, n=∞) t₀.₀₂₅,₂₉ = 2.045 (n=30) t₀.₀₂₅,₉ = 2.262 (n=10) t₀.₀₂₅,₄ = 2.776 (n=5) At n=5: the critical value is 1.42 times larger! More probability in the extremes -> harder to reject H₀. This is honest: with 5 measurements there is less information about σ.
Why must you use the t-distribution (not the normal) for a small sample (n < 30) with unknown σ?
Three Types of t-Test
Three Types of t-Test
1. One-Sample: Comparison Against a Constant
H₀: μ = μ₀ (true mean equals the specified value) H₁: μ ≠ μ₀ (two-tailed) T = (X̄ - μ₀) / (S/√n) ~ t(n-1) Example: API latency should be ≤ 200 ms (μ₀ = 200). Sample of 20 requests: X̄ = 215 ms, S = 30 ms. T = (215 - 200) / (30/√20) = 15 / 6.71 = 2.237 df = 19, critical t₀.₀₂₅,₁₉ = 2.093 2.237 > 2.093 -> reject H₀: latency significantly exceeds 200 ms.
2. Two-Sample: Comparing Two Groups
H₀: μ₁ = μ₂ (means of two groups are equal) H₁: μ₁ ≠ μ₂ Welch t-test (does not assume equal variances): T = (X̄₁ - X̄₂) / √(S₁²/n₁ + S₂²/n₂) Degrees of freedom (Welch-Satterthwaite formula): df = (S₁²/n₁ + S₂²/n₂)² / ((S₁²/n₁)²/(n₁-1) + (S₂²/n₂)²/(n₂-1)) Why Welch and not Student (pooled)? Student assumes σ₁ = σ₂. If that is not the case - the test is incorrect. Welch works in both cases: with equal and unequal σ. Modern standard: **use Welch by default**.
3. Paired: Before and After
Each subject is measured twice: before and after treatment. H₀: mean change = 0 Dᵢ = Xᵢ_after - Xᵢ_before T = D̄ / (S_D/√n) ~ t(n-1) Advantage: removes between-subject variability. Example: service latency before and after optimization across 15 servers. Without pairing: Var(X̄_after - X̄_before) = σ²_after/n + σ²_before/n With pairing: Var(D̄) = σ²_D/n (σ²_D << σ²_after + σ²_before) The paired test is 2-5 times more powerful than the two-sample test on paired data.
Which t-test do you use to compare service latency BEFORE and AFTER an optimization measured on the same 15 servers?
Effect Size: Statistical Significance Does Not Equal Practical Importance
Effect Size: Statistical Significance Does Not Equal Practical Importance
The p-value depends on n: with a million users, any conversion difference of 0.0001% will be statistically significant. Cohen's d measures the **effect size in units of standard deviation** - independent of n.
| Cohen's d | Interpretation | ML Example |
|---|---|---|
| < 0.2 | Negligible | Model beats baseline by 0.01% accuracy |
| 0.2 - 0.5 | Small | New feature improves F1 by 0.5% |
| 0.5 - 0.8 | Medium | A/B: conversion 3% vs 3.5% |
| > 0.8 | Large | Transformer vs RNN on seq2seq tasks |
| > 1.2 | Very large | Vaccine vs placebo in COVID trials |
**Production rule**: always report three numbers: p-value (is the difference significant), Cohen's d or relative lift (how important the difference is), confidence interval for the difference. A p-value alone is not enough to make a deployment decision.
An A/B test on 1 000 000 users finds a 0.01% conversion difference "statistically significant" (p < 0.001). What does that mean for shipping?
Where the t-Test Lives in Real Systems
Where the t-Test Lives in Real Systems
Which approach does MODERN practice recommend by default for comparing two independent groups?
Practice: A/B Test for a Recommendation Service
Practice: A/B Test for a Recommendation Service
A/B test: control CTR = 4.0% (n=2000), variant CTR = 4.5% (n=2000). Two-sample test p-value = 0.04. What is a sound report?
Key Takeaways
- **t(ν) distribution**: heavier than normal due to uncertainty in estimating σ. At ν = n-1 → ∞ it converges to N(0,1)
- **Three types**: one-sample (X̄ vs constant), two-sample Welch (two independent groups), paired (same subjects before/after)
- **Welch = default**: does not assume σ₁=σ₂, works in both cases with minimal power loss
- **Cohen's d**: effect size independent of n. Small (<0.5), medium (0.5-0.8), large (>0.8). Always report alongside p-value
- **Paired is more powerful**: removes between-subject noise, 2-5 times more efficient when correctly applied
- **Statistical significance is not practical significance**: with large n any difference is significant. Look at lift + CI + ROI
What's Next
The t-test is for numerical data. For categorical data - a different tool.
- Chi-Square Test — Test for categorical data: SRM in A/B tests, distribution fit, independence
- Rank Tests (Mann-Whitney) — When data are non-normal: the non-parametric alternative to the t-test
- Bootstrap — t-test without normality assumptions via resampling
- ANOVA — Generalization of the t-test to k > 2 groups without inflating the type I error rate