Information Theory
Minimum Description Length (MDL)
Why does a neural network with a billion parameters sometimes generalize worse than one with a million? Why does L2 regularization work? MDL provides a unified principle: the best model is the one that best compresses the data while accounting for its own complexity.
- **Pruning and quantizing LLMs**: LoRA, GPTQ, INT4, weight compression with minimal quality loss, practical MDL in action
- **Decision trees**: the MDL stopping criterion (Fayyad-Irani) is standard in systems like C4.5, scikit-learn
- **BIC in statistics**: selecting the number of GMM components, polynomial degree, and Bayesian network structure
Предварительные знания
The MDL Principle: Model as Compression
Why regularize ML models? Intuitively, to avoid overfitting. But why does it work? **MDL** (Minimum Description Length, Rissanen 1978) gives a deep answer: a good model must **compress the data**. The best model is one that can be described concisely and predicts the data well.
MDL is a practical approximation to Kolmogorov complexity K(data). While K is uncomputable, MDL provides concrete computable criteria for model selection. It is the bridge between the theoretical ideal and practical machine learning.
MDL and Bayesian inference are deeply connected. Assign model M an a priori probability P(M) ∝ 2^(-L(M)); then the MAP estimate (maximum a posteriori) is equivalent to MDL. Code length = -log P: L(M) = -log₂ P(M). Choosing a code = choosing a prior. This lets one interpret any Bayesian method as MDL and vice versa.
A degree-100 polynomial fits 30 data points perfectly (RSS ≈ 0). MDL says this model is:
Two-Part Codes and Stochastic Complexity
Early MDL by Rissanen used **two-part codes**: encode the model and data separately. But there is a problem: how to choose the "length" for continuous parameters? Rissanen solved this with **stochastic complexity**: a more refined version of MDL.
**NML (Normalized Maximum Likelihood)** is the ideal version of MDL. The code that minimizes stochastic complexity is given by the normalized likelihood function. It is invariant to reparametrization and optimal in a minimax sense, worst-case for the "hardest" distribution.
| Criterion | Formula (minimize) | Parameter penalty |
|---|---|---|
| AIC (Akaike) | -2 log L + 2k | 2k (weak) |
| BIC (Bayesian IC) | -2 log L + k log n | k log n (strong) |
| MDL (Rissanen) | -log p(D|θ̂) + k/2 log n | ≈ BIC/2 |
| NML (exact MDL) | -log p_NML(data) | Optimal |
BIC penalizes parameters more strictly than AIC for large n. What does this mean in practice?
MDL and Regularization in ML
L1 and L2 regularization seem like technical tricks. But through the MDL lens they receive a deep justification: they are nothing other than **Bayesian priors** on parameters, equivalent to penalizing the model description length.
**Minimum Description Length of Neural Networks** (Hinton & van Camp, 1993) proposed measuring the "length" of network weights. This led to weight sharing, pruning, and weight quantization. Modern model compression methods (LoRA, QLoRA, INT4 quantization) are practical realizations of the MDL principle: find the shortest description of a model that preserves quality.
L2 (Ridge) regularization, from the MDL perspective, is equivalent to:
MDL in Practice: Trees, Clusters, Features
MDL is especially powerful where model complexity cannot be expressed as a simple parameter count, for example, decision trees, clustering, or feature selection. Here MDL provides a unified principle without additional assumptions.
**MDL for feature selection**: including a feature is justified if the decrease in L(data|M) (better prediction) exceeds the increase in L(M) (more complex model). This automatically accounts for feature informativeness and correlations. Unlike p-values and correlation thresholds, MDL accounts for the cost of false positives.
**Perplexity** of a language model is 2^H, where H is the cross-entropy on the test set. Minimum perplexity corresponds to maximum compression of the text by the model, MDL in action! GPT-4 achieves perplexity ~5-8 on English text, approaching Shannon's theoretical limit (~1.3 bits/character). Lower perplexity means the model describes language more compactly.
The MDL criterion for clustering automatically resolves the problem of choosing the number of clusters k. How does it work?
Key Ideas
- **MDL**: a good model minimizes L(M) + L(data|M), balancing complexity and goodness of fit
- **Connection to BIC**: two-part MDL ≈ BIC; NML is the exact MDL via normalized likelihood
- **Regularization = MDL**: L2 ≡ Gaussian parameter code; L1 ≡ sparse (Laplace) code
- **Practice**: pruning, choosing k for clusters and tree depth, all solved by the same principle
Related Topics
MDL is a practical bridge between information theory and machine learning:
- Kolmogorov Complexity — MDL is a computable approximation to the uncomputable K(x)
- IT in Statistics: Sufficient Statistics — Sufficient statistic = minimal lossless description of data, MDL in statistics
- IT in Machine Learning — MDL explains regularization, pruning, and neural architecture selection
Вопросы для размышления
- Perplexity of GPT-4 ≈ 6. What does this mean from the MDL perspective? How well does the model 'compress' English?
- L2 and L1 regularization correspond to Gaussian and Laplace priors. Which prior should one choose for a task where most features are irrelevant?
- MDL uses code length as a quality measure. But code length depends on the choice of coding language. How does this affect the practical applicability of MDL?