Information Theory

Rate-Distortion Theory

Цели урока

Define the R(D) function via mutual information minimization
Derive and apply the closed-form R(D) for the Gaussian source with MSE
Understand water-filling and the source coding theorem with distortion
Connect R(D) theory to neural compression and the VAE framework

Предварительные знания

Differential Entropy
Mutual Information and KL Divergence

What is the minimum rate to transmit a Gaussian source with a given MSE? How should bits be optimally distributed across multiple components?

Netflix 4K at 2 Mbit/s operates within 3 dB of Shannon's R(D) bound
Neural network weight quantization (int4/int8) is a direct R(D) application
JPEG 2000 uses water-filling across wavelet subbands
Beta-VAE selects a point on the rate-distortion curve via the beta parameter

From Sources to the Bound

Claude Shannon laid the foundations of information theory in 1948. The rate-distortion coding theorem appeared in 1959 in 'Coding theorems for a discrete source with a fidelity criterion'. This was the step from binary channel coding to continuous sources. Shannon showed that every source has an R(D) function and that coding below R(D) while maintaining distortion D is fundamentally impossible. Berger systematized the theory in the 1971 monograph 'Rate Distortion Theory'. Practical application came with JPEG (1992), JPEG 2000 (2000), and H.265 (2013) - all operating near but not at the R(D) bound.

The Rate-Distortion Function R(D)

Netflix compresses 15 petabytes of video content using H.265 (HEVC): at a target bitrate of 2 Mbit/s for 4K video, the codec operates within 3 dB of Shannon's R(D) bound from 1959. Three decibels - the cost of not having cracked that 1959 formula completely.

Neural network quantization is a direct application of R(D) theory. Each weight tensor is a random source with variance sigma^2. R(D) predicts the minimum bitrate for a given reconstruction accuracy. This is why int4 quantization achieves 4 bits per weight without catastrophic accuracy loss: neural networks behave close to Gaussian sources.

R(D) is Shannon's lower bound: no encoder can transmit a source with distortion D using fewer than R(D) bits per symbol. The gap between R(D) and a real codec's performance is the rate-distortion gap.

What does R(D) = 0 mean for a discrete source?

R(D_max) = 0: when D >= D_max the optimal decoder outputs a fixed constant (mean or median) without receiving any bits. Infinite distortion is acceptable - no information is needed.

Gaussian R(D) and Water-Filling

1959. Shannon publishes 'Coding theorems for a discrete source with a fidelity criterion'. Among the results: a closed-form R(D) for the Gaussian source. Its elegance lies in having no integral to minimize over.

JPEG 2000 and Water-Filling

JPEG 2000 decomposes an image using wavelet transforms into subbands with different variances. Then water-filling allocates bits: the low-frequency subband (sigma^2 = 100) gets 3 bits per coefficient, high-frequency (sigma^2 = 0.01) gets zero. This is why JPEG 2000 at low bitrates looks softer than classic JPEG: it optimally stays silent where visual quality does not depend on detail.

What is the rate-distortion function R(D) for a Gaussian source N(0, sigma^2) with MSE distortion?

For N(0, sigma^2) with MSE: R(D) = 1/2 * log2(sigma^2/D) for 0 <= D <= sigma^2. The 1/2 factor comes from the continuous nature of the Gaussian source. The formula R(D) = 1/2 * log2(1 + sigma^2/D) is the channel capacity formula, not the source R(D).

Source Coding Theorem with Distortion

Shannon's rate-distortion theorem is symmetric to the channel coding theorem. For channels: reliable transmission is possible if and only if R < C. For sources: coding with distortion D is possible if and only if R >= R(D). Two sides of the same coin - a duality of theories.

The gap between R(D) and a real codec is the rate-distortion gap. H.265 lags behind the theoretical bound by 2-4 dB in PSNR at typical bitrates. AV1 closed this gap by roughly 30% compared to H.265.

VAE (Variational Autoencoder) is an implementation of the rate-distortion coding theorem in neural networks. The latent vector z is the code. MSE reconstruction plus KL divergence form the rate-distortion tradeoff. Beta-VAE with parameter beta selects a point on the R(D) curve: larger beta means lower R and higher D.

What does the achievability part of the R(D) theorem guarantee?

Achievability: at R > R(D) there exists a sequence of (n, 2^{nR})-codes with distortion <= D + epsilon for large enough n. This is an existence guarantee - constructively such codes are built by random codebook selection.

R(D) in Practice: Perceptual Measures and Neural Compression

MSE is convenient mathematics but a poor model of perception. Humans see structure, not pixels. In 2020 Google Brain published 'High-Fidelity Generative Image Compression': a neural codec beating H.265 at 0.1 bit per pixel. The key: a GAN loss function instead of MSE - a different d(x, x-hat) in the R(D) formula.

Neural Compression and Beta-VAE

balle2018 (Google Brain, 2018): a neural codec based on a variational autoencoder. Architecture: encoder f_theta(x) -> z, quantizer Q(z) -> z_hat, decoder g_phi(z_hat) -> x_hat. Loss: R + lambda*D = -log p(z_hat) + lambda * MSE(x, x_hat). The parameter lambda corresponds to a point on the R(D) curve: lambda=0.01 gives low bitrate, lambda=0.1 gives high quality. At 0.1 bit per pixel the result is subjectively twice as good as JPEG 2000.

Why do neural codecs with perceptual loss sometimes outperform H.265 in subjective quality?

R(D) holds for any distortion measure d(x, x-hat). MSE is technically convenient but does not model vision. Neural codecs change d to a perceptual measure and optimize R(D) for that measure. H.265 is optimized for MSE/PSNR and loses when evaluated on perceptual metrics.

Connection to Other Topics

R(D) theory unifies: mutual information (definition via R(D) = min I(X;X-hat)), the method of types (proof of the coding theorem), and water-filling (from portfolio theory via Kelly's criterion). In ML: beta-VAE directly implements the R(D) tradeoff; the information bottleneck (Tishby, 1999) generalizes R(D) to relevance-based tasks. Duality with channel capacity: C = sup_px R(0) over channels; R(D) = inf over sources at given D.

Channel Capacity — Dual maximization problem - capacity maxes mutual information over channels while R(D) minimizes it over sources
Differential Entropy and Continuous Channels — Provides Gaussian source entropy used in R(D) closed-form calculations
Information Theory in Deep Learning — beta-VAE objective is a direct neural implementation of the R(D) tradeoff

Итоги

R(D) = min I(X;X-hat) over all conditional distributions with expected distortion <= D
Gaussian source with MSE: R(D) = 1/2 * log2(sigma^2/D) - the only closed-form result
Water-filling: bits go where sigma_i^2 > theta; weak components stay silent
Theorem: at R > R(D) reliable compression with distortion D is possible; at R < R(D) - not

Вопросы для размышления

Why does the Gaussian source have a closed-form R(D) while most sources do not?
How does the choice of distortion measure d(x, x-hat) affect the optimality of a codec for a specific application?
What is the fundamental difference between R(D) source theory and the channel capacity theorem?

Связанные уроки

it-22 — channel capacity is the dual problem to R(D)
it-20 — differential entropy is needed for Gaussian R(D)
it-06 — noisy channels provide physical context for rate-distortion