Statistical Learning Theory

PAC Learning Extensions

Standard PAC assumes the correct answer lies in class H. In practice it does not. Agnostic PAC is honest: it guarantees competition with the best in H, nothing more.

**Search ranking:** Google Search uses agnostic PAC guarantees with 10^9 examples and VC-dimension 10^6 - realizability is known to be violated
**Medical diagnosis:** Learning with imprecise or contested diagnoses - a classic agnostic setting where the true label is unknown or noisy
**AutoML:** Model selection from class H without realizability assumption: the agnostic bound guides architecture search
**Financial forecasting:** Markets are not generated by any model in H; agnostic PAC formalizes competition with the best available strategy

Предварительные знания

Standard PAC learning
VC-dimension
Online learning and bandits

Online Learning and Bandits

Agnostic PAC learning

Google Search uses agnostic PAC guarantees for ranking models: with 10^9 training examples and a hypothesis class of VC-dimension d = 10^6, the generalization error stays below epsilon = 0.01 with probability 1 - delta = 0.95.

How does agnostic PAC differ from realisable PAC?

In realisable PAC: best-in-class achieves 0 error (some h* ∈ H with R(h*) = 0). In agnostic PAC: the best hypothesis h* = argmin R(h) may have R(h*) > 0. ERM guarantees R(ERM) <= R(h*) + sqrt(O(VC log(n))/n) - suitable for noisy or misspecified models.

Uniform convergence and ERM

Which property of H guarantees ERM-learnability in the agnostic setting?

Fundamental theorem of statistical learning (Vapnik-Chervonenkis): VC(H) < ∞ ↔ H is agnostically PAC-learnable via ERM with sample complexity O((VC(H) + log(1/delta))/eps²). Rademacher complexity gives finer distribution-aware bounds.

Improper learning and extensions

Agnostic PAC as competing with the best team member - Class H is a team of specialists. The true answer is unknown to everyone. The task is to perform no worse than the best team member, measured by actual (not empirical) performance. Zero error cannot be guaranteed - but competition with the best available tool can. This is the honest formulation for real-world ML.

What is improper learning?

In proper learning A(S) ∈ H. In improper learning A(S) may lie outside H. This often yields strictly better bounds: for disjunctions on n variables proper learning requires Theta(n/eps) examples while improper learning needs only O(log(n)/eps). DNNs can be viewed as improper learners for classes of simple functions.

Connections

Agnostic PAC connects classical learnability theory with modern guarantees for neural networks and online methods.

VC theory — Related topic
Rademacher complexity — Related topic
Online learning — Related topic
Noise models — Related topic

Итоги

Agnostic PAC requires no realizability: the goal is to compete with opt_H, not find c*
ERM bound: L(h_S) no worse than opt_H + O(sqrt(log|H|/m)) for finite H
Uniform convergence is necessary and sufficient for agnostic PAC learnability
Sample size: m = O((log|H| + log(1/delta)) / epsilon^2) for an (epsilon, delta)-agnostic guarantee
Massart noise (eta < 1/2) allows a smaller sample complexity than the fully agnostic case

What is the key difference between agnostic and standard PAC learning?

Agnostic PAC: the true concept may not lie in H. The goal is to find h with error no worse than opt_H + epsilon. No realizability assumption is made.

Agnostic PAC learning

How does agnostic PAC differ from realisable PAC?

Uniform convergence and ERM

Which property of H guarantees ERM-learnability in the agnostic setting?

Improper learning and extensions

What is improper learning?

Итоги

Agnostic PAC requires no realizability: the goal is to compete with opt_H, not find c*

ERM bound: L(h_S) no worse than opt_H + O(sqrt(log|H|/m)) for finite H

Uniform convergence is necessary and sufficient for agnostic PAC learnability

Sample size: m = O((log|H| + log(1/delta)) / epsilon^2) for an (epsilon, delta)-agnostic guarantee

Massart noise (eta < 1/2) allows a smaller sample complexity than the fully agnostic case