Information Theory
Compressed Sensing and RIP
Цели урока
- Formulate the compressed sensing problem and the RIP condition
- Explain why l1 relaxation works when RIP holds
- Describe ISTA/FISTA algorithms and the concept of algorithm unrolling
- Connect CS to information theory via the information lower bound
Предварительные знания
- Rate-Distortion Theory
- Source Coding Theory
How can O(k log n/k) random measurements - far fewer than Nyquist - recover a k-sparse signal from n elements? What does RIP guarantee?
- CS-MRI (Siemens, GE): scan time reduced from 30 to 8-15 minutes
- THz single-pixel cameras: CS makes IR detectors 1000x cheaper
- LISTA: neural CS decoder 20-50x faster than iterative ISTA
- Random projections in ML: Johnson-Lindenstrauss + CS as the theoretical basis
CS: the 2006 Compression Revolution
2006: Candes, Romberg, Tao publish 'Robust Uncertainty Principles' and 'Stable Signal Recovery from Incomplete and Inaccurate Measurements'. Same year: Donoho publishes 'Compressed Sensing'. Simultaneous discovery by two groups - a signal the idea had matured. Foundations in earlier work: Basis Pursuit (Chen-Donoho-Saunders 1998), Uncertainty Principle in signal processing (Donoho-Stark 1989). First application: CS-MRI (Lustig et al., 2007). 2010: LISTA - neural CS decoder. 2012: CS in radar (DARPA, MIT LL). Today CS is part of the standard information theory curriculum.
Compressive Sensing: Fewer Measurements Than Nyquist
The Nyquist-Shannon theorem says: to reconstruct a signal with bandwidth B, one needs 2B samples per second. Compressive sensing (CS) says: if the signal is sparse, a 5-10x smaller number of random measurements suffices. Candes, Romberg, and Tao proved this in 2006. That same year Donoho published parallel results. A new field begins.
MRI (magnetic resonance imaging) is the medical application of CS. MRI signals are sparse in the frequency domain. Standard MRI: scanning time 30 minutes. CS-MRI (Lustig et al., 2007): 8-15 minutes at equal quality. For claustrophobic patients or children this is decisive. CS-MRI is now built into Siemens, GE, and Philips machines.
CS and Shannon's theorem: CS does not contradict the Nyquist-Shannon theorem. Nyquist assumes band-limited signals with no additional structure. CS uses sparsity as additional structure - a different class of signals. Shannon R(D) for a k-sparse signal: R(D) ~ k/2 * log(sigma^2/D), not n/2 * log(sigma^2/D).
Why does CS not contradict the Nyquist-Shannon theorem?
Nyquist: for bandwidth-B signals 2B samples are needed - exact theorem for band-limited signals. CS: for k-sparse signals from n elements, O(k log n) random measurements suffice - theorem for sparse signals. Different signal classes, different theorems.
The Restricted Isometry Property
RIP is the mathematical condition on measurement matrix Phi guaranteeing exact recovery via l1 minimization. Candes and Tao introduced RIP in 2005. Intuition: Phi nearly preserves distances between sparse vectors. This means sparse vectors can be distinguished by their compressed measurements.
Practical Measurement Matrices
Gaussian random matrix: Phi_{ij} ~ N(0, 1/m). Satisfies RIP with probability >= 1 - exp(-cm). Drawback: storing n*m numbers is expensive. Partial DFT matrix: select m random rows from an n-point DFT. Used in CS-MRI: random k-space trajectory instead of uniform. Hadamard matrix: fast Walsh-Hadamard transform, O(n log n). Used in CS spectroscopy.
What does RIP with delta_{2k} < sqrt(2) - 1 guarantee?
RIP with delta_{2k} < sqrt(2)-1 guarantees: l1 minimization (LASSO/basis pursuit) recovers x with error <= C*sigma_k(x)/sqrt(k) + D*noise. Here sigma_k(x) is the best k-sparse approximation error. If x is exactly k-sparse and there is no noise: exact recovery.
Recovery Algorithms and Connection to ML
Basis pursuit (l1 minimization) is a convex problem, solvable via LP or ADMM. For large n (MRI images) iterative algorithms are faster. ISTA/FISTA - proximal methods for LASSO. OMP (Orthogonal Matching Pursuit) - greedy algorithm. Neural network unrollings (LISTA, ISTA-Net) learn an 'accelerated' ISTA.
LISTA (Learned ISTA, Gregor and LeCun, 2010): unroll K steps of ISTA into a neural network and train. Each layer: W_e * y + W_s * x + S_theta. Training: minimize reconstruction error on (x, y) pairs. Result: 20-50x speedup over ISTA at equal quality. This is an early example of 'algorithm unrolling' - now widely applied in MRI, sEEG, and radar.
The Single-Pixel Camera
Rice University, 2006: 'Single Pixel Camera'. Instead of a CCD array - one photodetector and a programmable DMD mirror. Each measurement: random +/-1 mask applied to the image, integrated on the detector. m=10000 measurements from n=65536 pixels (16%). LASSO recovers the image. Application: IR and THz cameras where large CCD arrays are expensive. Each THz pixel costs around USD 10000 - the single-pixel CS camera is 1000x cheaper.
What is 'algorithm unrolling' in the context of LISTA?
Algorithm unrolling: an iterative algorithm (ISTA, AMP, ADMM) is unrolled into a fixed number of neural network layers. Parameters (weight matrices, thresholds) are learned by gradient descent. LISTA (10 layers) recovers the quality of 50-100 ISTA iterations by adapting to the data distribution.
CS Through the Lens of Information Theory
Compressive sensing and information theory are deeply connected. CS is source coding with a special sparsity structure. The number of measurements m = O(k log n/k) is exactly the amount of information needed to describe a k-sparse vector from n elements in the noiseless case.
Approximate Message Passing (AMP, Donoho-Maleki-Montanari, 2009): a CS algorithm optimal for Gaussian measurement matrices. AMP state evolution - analogous to density evolution for LDPC - precisely predicts recovery error for any m, n, k. AMP works orders of magnitude faster than ISTA and achieves the information-optimal bound m = O(k log n/k) for Gaussian Phi.
Why is the information-optimal number of CS measurements O(k log n/k)?
log C(n,k) ~ k*log(n/k) is the entropy of the uniform distribution over k-sparse patterns. Fewer measurements do not provide enough information to uniquely determine the support. CS at m = C*k*log(n/k) is information-theoretically optimal.
Connection to Other Topics
Compressive sensing unifies: information theory (lower bound k*log(n/k)), convex optimization (l1 problem, ADMM, FISTA), and probabilistic methods (RIP for random matrices, AMP state evolution). In ML: LASSO = l1 regularization = feature selection; Johnson-Lindenstrauss = random projections for dimensionality reduction; algorithm unrolling = LISTA, ISTA-Net. CS phase transition is the analog of the threshold SNR in LDPC density evolution.
- Rate-Distortion Theory — Sparse signals make R(D) collapse to k log(n/k) - the compressed sensing lower bound
- Information Theory in Machine Learning — L1 regularization (LASSO) is the convex relaxation used in compressed sensing reconstruction
- Kolmogorov Complexity — Sparsity is a structural notion of complexity that complements algorithmic Kolmogorov complexity
Итоги
- CS: m = O(k log n/k) random measurements suffice to recover k-sparse x from n elements
- RIP delta_{2k} < sqrt(2)-1 guarantees exact recovery via l1 minimization
- Random matrices (Gaussian, Bernoulli) satisfy RIP with high probability at m >= Ck*log(n/k)
- ISTA: soft threshold + gradient step; LISTA accelerates 20-50x through learning
Вопросы для размышления
- Why does replacing l0 with l1 in the CS problem work if these are different norms?
- How does the CS phase transition resemble the threshold SNR in LDPC density evolution?
- How does LISTA break the separation between 'algorithm' and 'data', and why does it work?