Information Theory
Kolmogorov Complexity
What makes the digits of π ordered rather than random? Why can't one build a compressor that shrinks every file? Kolmogorov complexity answers these questions by providing an individual measure of information-without probabilities.
- **Data compression**: the incompressibility theorem explains why ZIP cannot shrink every file - most files are already incompressible
- **Machine learning**: MDL and regularization are practical approximations of the principle "short model = good model"
- **Cryptography**: randomness tests for keys rely on incompressibility - a good key must be algorithmically random
Предварительные знания
Definition of Kolmogorov Complexity
Two 24-bit strings: `010101010101010101010101` compresses to "repeat '01' 12 times" (15 chars). `1011001110100101110001` cannot be described shorter than itself. Kolmogorov complexity formalizes this: it is the length of the shortest program producing a given string. LLMs exploit low Kolmogorov complexity - language is compressible.
Key difference from Shannon entropy: entropy describes a **source** (a probability distribution), while Kolmogorov complexity describes a **specific string** without any probabilities. It is an individual measure of information, not a statistical one.
K(x) depends on the choice of universal Turing machine U, but only by a **constant**: if K_U(x) and K_V(x) are the complexities for two different machines, then |K_U(x) - K_V(x)| ≤ c_{U,V} for all x. This constant is independent of x, so for asymptotic analysis the choice of machine is irrelevant.
Three independent discoverers
Kolmogorov complexity was independently discovered by three scientists in the 1960s: Soviet mathematician Andrey Kolmogorov (1965), American Ray Solomonoff (1964), and American Gregory Chaitin (1969). Solomonoff actually preceded Kolmogorov, but Kolmogorov's name stuck. Kolmogorov himself called it "algorithmic information theory".
A string of 1,000,000 zeros has Kolmogorov complexity approximately:
Incompressibility and Randomness
Most long strings are **incompressible**. Formally: a string x is c-incompressible if K(x) ≥ |x| - c. Such strings are algorithmically random-they contain no patterns. This is the Kolmogorov definition of randomness: a string is random if it cannot be compressed.
This directly connects to the Kolmogorov definition of **randomness**: a Martin-Löf random string is an incompressible string. If a string is statistically random, it will be incompressible. Conversely, any statistical test for randomness can be used as a test for incompressibility.
Kolmogorov complexity is **uncomputable**: no algorithm can compute K(x) for arbitrary x. This follows from the halting problem: if K(x) were computable, one could solve whether a program halts. one can only approximate K(x) from above: enumerate all programs and find the shortest one outputting x, but one can never know if a shorter one exists.
Why is Kolmogorov complexity uncomputable?
Kolmogorov Complexity vs. Shannon Entropy
Shannon entropy and Kolmogorov complexity both measure "information", but from different angles. Shannon: a source is a probabilistic process, entropy describes an ensemble average. Kolmogorov: a specific string, no probabilities needed. How are they related?
| Aspect | Shannon Entropy | Kolmogorov Complexity |
|---|---|---|
| Object | Distribution p(X) | Specific string x |
| Probabilities | Required | Not needed |
| Computability | Computable | Uncomputable |
| Dependence on n | Averaged over n | Individual property |
| Application | Coding, channel capacity | Complexity theory, randomness |
The Minimum Description Length principle (MDL) is a practical approximation to K(x): choose the model that minimizes the sum of "model length + data length given the model". This is applicable in practice (unlike K) and approximates the Kolmogorov ideal. MDL is covered in lesson 19.
What is the key property that distinguishes Kolmogorov complexity from Shannon entropy?
Applications: Compression, ML, Randomness
Despite being uncomputable, Kolmogorov complexity has an enormous practical influence. It provides the theoretical foundation for data compression, machine learning, and randomness testing.
In **machine learning**, Kolmogorov complexity relates to generalization: a model that "compresses" training data (has a compact description) generalizes better. This underlies MDL (Minimum Description Length) and PAC learning. Modern neural networks seek "simple" solutions via regularization-a practical approximation to minimizing K.
Marcus Hutter (2000) used Kolmogorov complexity to define **AIXI**: a theoretically optimal intelligent agent. AIXI chooses actions maximizing reward, where the future is predicted by a weighted mixture of all computable worlds with weights 2^(-K(world)). This is uncomputable, but provides an ideal theoretical benchmark for AI research.
NCD (Normalized Compression Distance) is a practical application of Kolmogorov complexity. What does NCD(x, y) ≈ 0 indicate?
Key Ideas
- **K(x)** = length of the shortest program outputting x - an individual measure of string complexity
- **Incompressibility**: most strings of length n have K(x) ≈ n - randomness and incompressibility are equivalent
- **Uncomputability**: K(x) cannot be computed exactly, only approximated from above via real-world compressors
- **Connection to Shannon**: E[K(x)] ≈ H(X) - equal on average, but K is an individual property
Related Topics
Kolmogorov complexity connects to several important areas:
- MDL Principle — Practical approximation to K(x) for model selection
- Shannon Entropy — Statistical analogue: H(X) = E[K(x)] for typical strings
- Lossless Coding — Huffman and LZ-practical algorithms approximating K(x)
Вопросы для размышления
- Kolmogorov complexity is uncomputable. How does this affect practical applications - what do we lose and what do we still gain from the theory?
- If randomness is defined as incompressibility, what does this say about pseudo-random numbers from a PRNG? How random are they in the Kolmogorov sense?
- Occam's Razor says the simplest explanation is the best. How does Kolmogorov complexity formalize and justify this principle?