Statistical Learning Theory

Compression-Based Bounds

Can generalization error be bounded without computing VC-dimension? Yes, through compression. If the algorithm stores only k out of m examples, generalization is guaranteed regardless of the input dimension.

**SVM and support vectors:** SVM naturally implements a compression scheme: k support vectors determine the hyperplane. The PAC bound through k SVs is often sharper than the VC estimate
**Compressed sensor networks:** Data compression in IoT: storing k key measurements instead of m provides theoretical guarantees on reconstruction accuracy
**k-NN with pruning:** Coreset methods for k-NN: storing k prototypes instead of m training points with a PAC guarantee via the coreset size
**Neural network quantization:** Compressing a model to k significant weights is analogous to a compression scheme; the LW theorem provides the theoretical basis for accuracy guarantees

Предварительные знания

PAC learning and VC-dimension
Agnostic PAC learning
Hoeffding inequality and union bound

Sample compression schemes

Niko Littlestone and Manfred Warmuth in 1986 proved: if a learning algorithm compresses the training set to k examples (from which it fully reconstructs), then the generalization error is at most O(k log m / m). In 2016, Livni and Mendelson proved the converse: any class with finite VC dimension has a compression scheme.

What is a compression scheme of size k?

Littlestone-Warmuth (1986): a compression scheme of size k for a class H is a pair (sigma, p) where sigma(S) selects k points from training sample S and p(sigma(S)) reproduces a hypothesis consistent with S. SVM with k support vectors is the canonical example: the hypothesis is determined entirely by them.

Connection to complexity and generalisation

What generalisation bound does a compression scheme of size k give?

Littlestone-Warmuth: if H has a compression scheme of size k and the algorithm returns h consistent with the training sample, then with probability 1-delta: R(h) <= (1/n)[k log(n/k) + log(C(n, k)/delta)]. VC link: VC(H) <= O(k log n).

Algorithms: SVM, boosting, and Sauer-Shelah

Compression scheme as an examiner's cheat sheet - An examiner (algorithm) studied m problems (training sample). The compression scheme: the examiner wrote down only k key problems in notes, from which all solution rules can be reconstructed. Shorter notes means less overfitting. The PAC bound via compression is independent of input dimension: only the number of examples the algorithm needs for full reconstruction matters.

Which algorithms use compression schemes directly?

SVM: the hypothesis is entirely determined by support vectors (often k << n). Boosting: the combined classifier sum h_t depends only on the selected weak learners. 1-NN: each point defines a Voronoi region. These schemes give concrete generalisation bounds through the number of "critical" examples.

Connections

Compression schemes connect classical PAC theory with data compression algorithms, SVMs, and modern neural network compression methods.

VC theory — Related topic
SVM — Related topic
Information-theoretic learning — Related topic
Neural network compression — Related topic

Итоги

Compression scheme (kappa, rho) of size k: kappa selects k examples, rho reconstructs a hypothesis with zero training error
PAC bound: err(h) at most O((k*log(m/k) + log(1/delta)) / m) with probability 1-delta
The bound depends on k, not on the dimension of the feature space
SVM is a compression scheme: k support vectors fully determine the solution
Livni et al. 2016: finite VC is equivalent to finite compression - closing a 30-year open problem

What does a compression scheme of size k guarantee?

The Littlestone-Warmuth theorem. The bound depends on k (compression size), not on the input feature dimension.

Sample compression schemes

What is a compression scheme of size k?

Connection to complexity and generalisation

What generalisation bound does a compression scheme of size k give?

Algorithms: SVM, boosting, and Sauer-Shelah

Which algorithms use compression schemes directly?

Итоги

Compression scheme (kappa, rho) of size k: kappa selects k examples, rho reconstructs a hypothesis with zero training error

PAC bound: err(h) at most O((k*log(m/k) + log(1/delta)) / m) with probability 1-delta

The bound depends on k, not on the dimension of the feature space

SVM is a compression scheme: k support vectors fully determine the solution

Livni et al. 2016: finite VC is equivalent to finite compression - closing a 30-year open problem