Statistical Learning Theory
Compression-Based Bounds
Can generalization error be bounded without computing VC-dimension? Yes, through compression. If the algorithm stores only k out of m examples, generalization is guaranteed regardless of the input dimension.
- **SVM and support vectors:** SVM naturally implements a compression scheme: k support vectors determine the hyperplane. The PAC bound through k SVs is often sharper than the VC estimate
- **Compressed sensor networks:** Data compression in IoT: storing k key measurements instead of m provides theoretical guarantees on reconstruction accuracy
- **k-NN with pruning:** Coreset methods for k-NN: storing k prototypes instead of m training points with a PAC guarantee via the coreset size
- **Neural network quantization:** Compressing a model to k significant weights is analogous to a compression scheme; the LW theorem provides the theoretical basis for accuracy guarantees
Предварительные знания
- PAC learning and VC-dimension
- Agnostic PAC learning
- Hoeffding inequality and union bound
Sample compression schemes
Niko Littlestone and Manfred Warmuth in 1986 proved: if a learning algorithm compresses the training set to k examples (from which it fully reconstructs), then the generalization error is at most O(k log m / m). In 2016, Livni and Mendelson proved the converse: any class with finite VC dimension has a compression scheme.
What is a compression scheme of size k?
Littlestone-Warmuth (1986): a compression scheme of size k for a class H is a pair (sigma, p) where sigma(S) selects k points from training sample S and p(sigma(S)) reproduces a hypothesis consistent with S. SVM with k support vectors is the canonical example: the hypothesis is determined entirely by them.
Connection to complexity and generalisation
What generalisation bound does a compression scheme of size k give?
Littlestone-Warmuth: if H has a compression scheme of size k and the algorithm returns h consistent with the training sample, then with probability 1-delta: R(h) <= (1/n)[k log(n/k) + log(C(n, k)/delta)]. VC link: VC(H) <= O(k log n).
Algorithms: SVM, boosting, and Sauer-Shelah
Compression scheme as an examiner's cheat sheet - An examiner (algorithm) studied m problems (training sample). The compression scheme: the examiner wrote down only k key problems in notes, from which all solution rules can be reconstructed. Shorter notes means less overfitting. The PAC bound via compression is independent of input dimension: only the number of examples the algorithm needs for full reconstruction matters.
Which algorithms use compression schemes directly?
SVM: the hypothesis is entirely determined by support vectors (often k << n). Boosting: the combined classifier sum h_t depends only on the selected weak learners. 1-NN: each point defines a Voronoi region. These schemes give concrete generalisation bounds through the number of "critical" examples.
Connections
Compression schemes connect classical PAC theory with data compression algorithms, SVMs, and modern neural network compression methods.
- VC theory — Related topic
- SVM — Related topic
- Information-theoretic learning — Related topic
- Neural network compression — Related topic
Итоги
- Compression scheme (kappa, rho) of size k: kappa selects k examples, rho reconstructs a hypothesis with zero training error
- PAC bound: err(h) at most O((k*log(m/k) + log(1/delta)) / m) with probability 1-delta
- The bound depends on k, not on the dimension of the feature space
- SVM is a compression scheme: k support vectors fully determine the solution
- Livni et al. 2016: finite VC is equivalent to finite compression - closing a 30-year open problem
What does a compression scheme of size k guarantee?
The Littlestone-Warmuth theorem. The bound depends on k (compression size), not on the input feature dimension.