Computability

Kolmogorov Complexity

In 1965, Soviet mathematician Andrei Kolmogorov formulated a question that had haunted statisticians for 50 years: is a specific string '10110100...' random? Before him there was no answer - randomness was a property of the generation process, not the string itself. Kolmogorov proposed measuring randomness by the length of the shortest description. This unified information theory, computability theory, and foundations of mathematics into a single framework.

  • **NCD classification:** Normalized Compression Distance clusters genomes, detects plagiarism, and compares languages without any training data
  • **MDL (Minimum Description Length):** a machine learning principle - choose the model minimizing K(data | model) + K(model). This formalizes Occam's razor and underlies regularization
  • **RNG testing:** NIST SP 800-22 uses statistical tests that approximate incompressibility checks for cryptographic random number generators

Kolmogorov Complexity: Definition

In 1965, Andrei Kolmogorov asked an uncomfortable question: what does it mean for a string to be 'random'? His answer reshaped information theory. The Kolmogorov complexity K(x) of a string x is the length of the shortest program (on a universal Turing machine) that outputs x and halts. The string 'aaaa...a' (one million 'a's) has tiny complexity - the program `print('a' * 1_000_000)` is short. A string of one million random bits has complexity around one million bits - no description is shorter than the string itself.

K(x) depends on the choice of universal Turing machine U, but only up to a constant: for any two machines U1 and U2, |K_U1(x) - K_U2(x)| <= c, where c is independent of x. This invariance property makes K(x) an objective measure of complexity.

What is K(x) for the string x = '000...0' (n zeros)?

Incompressible Strings and the Incompressibility Lemma

Most strings are incompressible - and this is not intuition but a theorem. Out of 2^n strings of length n, exactly 2^(n-c) - 1 strings have K(x) >= n - c. For c = 1 that is more than half. Pigeonhole principle: there are exactly 2^(n-1) programs shorter than n-1 bits - they cannot describe all 2^n strings. The incompressibility lemma is used in theoretical CS just like the probabilistic method: assume a string is c-incompressible (K(x) >= |x| - c) and derive lower bounds.

Incompressibility Lemma: for any c >= 0, a c-incompressible string of length n exists for every n >= c. The fraction of incompressible strings also approaches 1 as n grows.

Why is it impossible to compress all strings of length n by even 1 bit?

Randomness via Complexity

What makes a string random? Before Kolmogorov there was no answer - randomness was defined by the generation process (coin flip), not by the string itself. Martin-Lof and Kolmogorov gave a structural definition: a string x is Martin-Lof random if and only if K(x) >= |x| - c for some constant c. A string that can be compressed contains hidden structure - a regularity that the compression algorithm exploits. A truly random string is one without structure; its shortest description is the string itself.

Martin-Lof-Chaitin Theorem: three approaches to randomness coincide: (1) Kolmogorov incompressibility, (2) passing all statistical tests (Martin-Lof), (3) impossibility of betting strategies (Schnorr martingales). This is one of the deepest results in theoretical CS.

The sequence 0101010101... (n symbols). What is K of this string approximately?

Applications of Kolmogorov Complexity

Kolmogorov complexity is more than a theoretical construct. The Normalized Compression Distance (NCD) allows clustering objects without any domain knowledge: NCD(x,y) = (K(xy) - min(K(x), K(y))) / max(K(x), K(y)). This distance classifies genomes, detects plagiarism, and clusters languages. In ML: Minimum Description Length (MDL) is Bayesian Occam's razor in K's language - choose the model minimizing K(data | model) + K(model). This is the basis for regularization.

K is not computable but can be approximated via real-world compressors (gzip, zstd). NCD based on gzip achieves expert-level accuracy on clustering tasks with zero training data.

Compressibility in real archivers (gzip, zstd) is the same as Kolmogorov complexity

Real compressors give an upper bound on K(x) but never reach the theoretical minimum

K(x) is the minimum over all possible programs on a universal Turing machine. Compressors implement specific algorithms (LZ77, Huffman) and do not enumerate all programs. The gap between K(x) and gzip output can be arbitrarily large.

Why cannot Kolmogorov complexity be computed exactly?

Key Ideas

  • **K(x) is the shortest program length:** minimum bits to describe an object. Regular strings have K = O(log n); random strings have K = n - O(1).
  • **Incompressibility is the norm:** most strings are incompressible. The fraction with K(x) >= n - c approaches 1 as n grows.
  • **Randomness = incompressibility:** a string is Martin-Lof random iff K(x) >= |x| - O(1). Three definitions of randomness coincide.
  • **K is uncomputable but approximable:** via real-world compressors. NCD gives a practically useful distance between any objects.

Related Topics

Kolmogorov complexity connects uncomputability and information theory.

  • Halting Problem — K(x) is uncomputable for the same reason - computing K reduces to HP
  • Mapping Reductions — Uncomputability of K is proved by reduction from the Halting Problem

Вопросы для размышления

  • If K(x) is uncomputable, how can it still be used in theoretical proofs? What exactly does the incompressibility lemma establish if incompressible strings cannot be found algorithmically?
  • The MDL principle says: choose the simplest model explaining the data. How does this differ from Bayesian inference? What is lost when moving from K(x) to real compressors?
  • The decimal expansion of pi (3.14159...) looks random but has a short description: 'first n digits of pi'. What does this reveal about the relationship between mathematical structure and randomness?

Связанные уроки

  • fl-01-intro
Kolmogorov Complexity

0

1

Sign In