Statistical Learning Theory
Agnostic PAC: when no true hypothesis exists
Number shock: in the realizable case, halving the error (epsilon -> epsilon/2) costs double the data. In the agnostic case - four times as much. The square is not coincidence. It is the price nature charges for not fitting inside H. Every modern ML task is agnostic. Logistic regression on noisy medical records, a transformer trained on internet text, RLHF on human ratings - c does not lie in H anywhere. The realizable case is a beautiful theorem. Agnostic PAC is the real world.
- **SGD batch size and the epsilon-squared law**: estimating mini-batch gradient variance via Hoeffding is agnostic PAC applied to optimization. Gradient noise variance scales as sigma^2/B (B = batch size). Halving the noise requires four times the batch - the same O(1/epsilon^2) structure. Adam and AdamW implicitly know this: adaptive learning rate compensates for the quadratic cost of precision.
- **Logistic regression and noisy labels**: ECG datasets are labeled by cardiologists with roughly 85% inter-annotator agreement. By definition c does not lie in H - the data is contradictory. Agnostic PAC guarantees: enough examples, and ERM reaches error no worse than the best linear classifier plus epsilon. This is the theoretical justification for why L2 regularization does not overfit even when hat-R > 0.
- **RLHF and non-realizable human preferences**: human feedback is contradictory across raters, non-stationary over time, and noisy. The Bradley-Terry model is ERM on agnostic data. RLHF sample complexity follows directly from Hoeffding: achieving 1% improvement in alignment score requires 100x more human feedback than 10%.
- **Kearns-Schapire-Sellie 1992**: the first formal agnostic theorem. Three authors proved: ERM in the agnostic case with m >= C/epsilon^2 * log(|H|/delta) achieves excess risk <= epsilon. Direct line to modern generalization bounds - VC-dimension, Rademacher complexity, PAC-Bayes - all use the same Hoeffding technique.
- **Regression and MSE**: mean squared error is agnostic PAC for real-valued labels. min_{h in H} E[(h(x)-y)^2] > 0 almost always (real data is not linear). OLS finds the ERM with noise. Sample complexity via Hoeffding for bounded functions: O(1/epsilon^2) examples to get epsilon-close to the oracle predictor.
Kearns-Schapire-Sellie 1992: honesty as a theorem
The paper 'Toward Efficient Agnostic Learning' (COLT 1992) appeared eight years after Valiant. Valiant built theory on ideal data - c in H. Kearns, Schapire, and Sellie asked: what if not? The result is hard: sample complexity rises from O(1/epsilon) to O(1/epsilon^2). No free assumptions - no free lunch. This is not a weaker algorithm, it is a lower bound: no PAC algorithm on agnostic data can do better in the worst case. Schapire later co-invented AdaBoost - the first algorithm operating in agnostic mode with practical guarantees. Straight line from 1992 to modern ML.
Concept 1: agnostic setup - what changes without c in H
Concept 1: agnostic setup - what changes without c in H
The realizable case assumed: c in H, there exists h* with R(h*) = 0. A strong assumption. In practice data is noisy - the same x can carry different labels. Nature is nonlinear, features are incomplete, raters disagree. c not in H is the norm, not the exception.
Logistic regression is trained on noisy medical data (labels contain errors). What does the agnostic PAC bound guarantee?
Agnostic PAC is precisely for noisy data. Zero error is neither possible nor expected. The guarantee is excess risk: the algorithm stays within epsilon of the oracle (best linear classifier on the true distribution). With m >= C/epsilon^2 * log(|H|/delta) this holds with probability >= 1 - delta.
Concept 2: Hoeffding's inequality - where the square comes from
Concept 2: Hoeffding's inequality - where the square comes from
The realizable proof relied on a specific mechanism: a bad hypothesis survives all m examples with probability $(1-\varepsilon)^m$. In the agnostic case this does not work - there is no hypothesis with hat-R = 0, the notion of failure is different. A tool is needed to bound how far a sample mean can stray from its expectation. Hoeffding provides exactly that.
**Hoeffding and mini-batch SGD**: mini-batch gradient variance is sigma^2/B, where B is batch size. This follows directly from Hoeffding applied to the mean gradient. Epsilon-close gradient estimates require B >= sigma^2/epsilon^2. AdamW's adaptive learning rate effectively compensates for exactly this quadratic cost of precision. Agnostic PAC and stochastic optimization share the same mathematics.
Why does agnostic PAC sample complexity contain 1/epsilon^2 rather than 1/epsilon as in realizable?
The realizable proof uses: bad h survives with P <= (1-eps)^m ~ exp(-eps*m). Solving for m: m ~ (1/eps)*ln(|H|/delta). Hoeffding for agnostic: P[|hat-R - R| > eps] <= 2*exp(-2*m*eps^2). Solution: m ~ (1/eps^2)*ln(|H|/delta). The square is a structural consequence of the tool, not a measure of task difficulty.
Concept 3: O(1/epsilon^2 log(|H|/delta)) - numbers and implications
Concept 3: O(1/epsilon^2 log(|H|/delta)) - numbers and implications
The agnostic sample complexity formula looks familiar from realizable, but epsilon squared changes everything. Here is what that means in numbers and in practice.
Example: medical diagnosis (agnostic)
ECG classification with noisy labels
Task: classify ECG signals (infarction / normal). Dataset: fifty thousand records, labels from three cardiologists, 85% agreement. Agnostic PAC formulation: - X = ECG waveforms, Y = {0, 1} - H = linear classifiers in R^1000 (after feature extraction) - c not in H: data is contradictory, optimal R(h*) > 0 Goal: find h_S with R(h_S) <= R(h*) + 0.02 (2% excess risk) with probability >= 0.95. Sample complexity: |H| via VC-dim: VC-dim(linear in R^1000) = 1001 Hoeffding-VC version: m >= C * (d + ln(1/delta)) / eps^2 m >= 8 * (1001 + ln(20)) / 0.0004 = 8 * 1004 / 0.0004 = ~twenty million In practice: fifty thousand records give excess risk ~0.1, not 0.02. This is agnostic PAC in action: theory predicts correctly.
The key step here is moving from finite H to infinite H. In the example above $|H| = \infty$ - and the formula through $\ln|H|$ breaks down. The replacement: VC-dimension $d$ instead of $\ln|H|$. The structure is the same - $O(d/\varepsilon^2)$ - only $d$ is measured differently. That is the topic of lt-04, but the agnostic mechanics stay unchanged: Hoeffding, union bound, squared epsilon.
Agnostic PAC is the pessimistic case; real algorithms do better, so the bound is not useful
Agnostic PAC is an exact lower bound: no algorithm can guarantee O(1/epsilon) for all agnostic distributions. Better bounds for specific classes come from VC-dimension or Rademacher complexity - not from improving Hoeffding.
The lower bound for agnostic PAC: there exist distributions where any algorithm requires Omega(1/epsilon^2) examples. This is not an imprecision of the proof technique - it is a mathematical fact. Better bounds for specific classes (e.g., linear classifiers with margin gamma) exploit the structure of the problem, but on worst-case distributions 1/epsilon^2 is unavoidable.
H is a class of 1024 = 2^10 hypotheses. epsilon = 0.1, delta = 0.05. How many examples are needed for the agnostic PAC guarantee via Hoeffding's bound?
m >= (1/(2*0.01)) * ln(2*1024/0.05) = 50 * ln(40960) = 50 * 10.62 = 531. Standard form with C=2: m >= (2/eps^2)*ln(2|H|/delta) = (2/0.01)*10.62 = 2124 examples. Realizable gave roughly 100. The 20x difference at eps=0.1 directly shows the price of agnosticity.
Key ideas
- **Agnostic setup**: c not in H is the norm for real data. The goal shifts: not R(h_S) <= epsilon, but R(h_S) <= min_H R(h) + epsilon. Excess risk is the only honest metric.
- **Hoeffding's inequality**: P[|hat-R(h) - R(h)| > eps] <= 2*exp(-2*m*eps^2). This is the concentration tool for sample mean - the core technique of the agnostic proof.
- **Squared epsilon**: agnostic bound m >= (1/(2*eps^2))*ln(2|H|/delta). Realizable: O(1/eps). Agnostic: O(1/eps^2). The gap is a lower bound, not a proof artifact.
- **Uniform convergence**: hat-R(h) approx R(h) simultaneously for all h in H - this is what Hoeffding plus union bound actually prove. ERM works because it picks h with small hat-R, and hat-R is close to R.
- **Kearns-Schapire-Sellie 1992**: the first formal result. Direct line to VC-dimension, Rademacher complexity, PAC-Bayes - all are agnostic by nature.
- **GPT-4 and noisy tasks**: failure on physical reasoning is not a bug. Agnostic PAC predicts it: on tasks where c does not lie in the transformer's H, excess risk > 0 is a mathematical law, not an engineering failure.
What this lesson unlocks
Agnostic PAC is the first honest result: no c-in-H assumption. Everything after builds on top.
- VC dimension — Replaces log|H| with VC-dim for infinite H; the agnostic structure is preserved
- Uniform convergence — Rigorous proof that hat-R approx R for all h simultaneously
- No-Free-Lunch — Lower bound: without constraints on H, no agnostic-type guarantee is possible
- PAC-Bayes bounds — Agnostic PAC for stochastic classifiers - tight bounds via KL-divergence
Вопросы для размышления
- GPT-4 failed on physical reasoning tasks. Does this indicate c is not in the transformer's H for those tasks - or insufficient training examples? How would one tell the difference?
- The agnostic bound O(1/eps^2) is a lower bound for worst-case distributions. For specific tasks (e.g., linearly separable data with margin gamma) the bound is tighter. Why?
- RLHF uses contradictory human preferences. How does agnostic PAC explain why large numbers of human feedback pairs are needed for good alignment?
- Realizable PAC gives O(1/eps), agnostic gives O(1/eps^2). What does 'knowing c is in H' mean informationally - why is that knowledge worth so much in sample complexity?
Связанные уроки
- lt-02-realizable-learning — Realizable is the base case; agnostic generalizes it
- lt-04-vc-dimension — VC-dim extends agnostic bounds to infinite hypothesis classes
- lt-07-uniform-convergence — Uniform convergence is the formal foundation of agnostic PAC
- lt-08-no-free-lunch — No-Free-Lunch theorem builds on the agnostic problem formulation
- prob-22-concentration — Hoeffding and concentration inequalities are the core of the proof
- stat-01-sampling