Digital Signal Processing
Compression: JPEG, MP3, H.264
YouTube streams 1 billion hours of video every day. If that ran as raw 1080p without compression, the channels would need 200 petabits per second. The real load is around 1 terabit, 200000x less. That delta is delivered by one trick discovered by Nasir Ahmed in 1972: the discrete cosine transform. Not more beautiful math, not more correct math than Fourier. Just slightly better at packing real-image energy into the first coefficients. And on this single formula rests the entire civilization of digital media.
- **Netflix transcoding pipeline**: every episode passes through about 40 encoding profiles (H.264/H.265/AV1, different bitrates), perceptual VMAF score picks the optimum. CDN savings - hundreds of millions of USD a year
- **WhatsApp voice messages**: AMR-WB at 12.65 kbps. Shannon would say speech carries about 30 kbps of information. The psychoacoustic model cuts it 2.5x without intelligibility loss
- **Tesla autopilot**: 8 cameras at 1280x960 @ 36 FPS. Raw stream 1.5 GB/s. After H.264 on the FSD chip - 50 MB/s into the neural net. Without compression the internal bus could not carry it
- **Apple ProRes vs H.264**: cinema productions shoot in ProRes (lossy but high-bitrate 220 Mbps) to keep colour information for post. Distribution downgrades to H.265 at 4 Mbps - 55x smaller
DCT: cosines instead of sines
An analogue of Fourier with one twist: only cosines, no sines. Why the asymmetry? A cosine is an even function, and DCT implicitly reflects an 8x8 block across its boundary - this packs energy into the first coefficients more efficiently. Real images are not white noise. The lion's share of their energy sits at low spatial frequencies (smooth gradients of sky, skin, asphalt), while high-frequency content tends to be sensor noise, grass texture, or JPEG artifacts from a previous generation.
DCT-II is the standard inside JPEG, MPEG and H.264. For an 8x8 pixel block it returns an 8x8 matrix of coefficients: the top-left value is mean luminance (DC), the other 63 are AC components at increasing frequencies. The familiar zigzag scan order from textbooks - from DC out to high AC - is statistical-significance ordering: most-likely-nonzero first.
Why 8x8 and not 4x4 or 16x16? A trade-off. Smaller block - faster, worse energy compaction. Larger block - better decorrelation but visible blocking artifacts. H.265 already uses adaptive blocks from 4x4 up to 64x64 - the encoder picks size by content. JPEG in 1992 chose 8x8 as the sweet spot for hardware of the era.
A JPEG encoder processes a sky photo with a smooth gradient. What happens to the DCT coefficients of an 8x8 block?
Quantization: where bits die
DCT loses no bits - it is just a change of basis, perfectly invertible. Real compression starts one step down. **Quantization** - each DCT coefficient is divided by an integer and rounded. Small coefficients become zero after division. In JPEG the quantization table is an 8x8 matrix: the upper-left corner has small divisors (10-20), the lower-right has large ones (60-120). That is precisely the 'quality slider' in Photoshop: shifting values of this matrix up or down.
Psychovisual model: the eye resolves fine detail poorly in the blue channel and best in green. JPEG quantizes Y (luminance) and Cb/Cr (chrominance) separately, and the Cb/Cr table is coarser. Plus before DCT colors are converted from RGB to YCbCr and Cb/Cr are subsampled by 2 (4:2:0) - a viewer does not notice. The same logic appears in LLM quantization: int4 kills less-significant bits of weights, model quality drops a few percent, model size shrinks 8x.
Artifacts emerge exactly where quantization gets aggressive: blocking on 8x8 boundaries (visible brightness jumps between blocks), Gibbs ringing around contrast edges (high frequencies wiped, edge dissolves into ripples), color shimmering on red grass (the Cr channel collapses first).
JPEG quality=10 produces a file 5x smaller than quality=80. What physically happens to the quantization matrix when quality drops?
Entropy coding: the final blow
After quantization an 8x8 matrix is in hand, most values zero, especially in the lower-right corner. That is half the battle, but each zero still occupies a byte. Entropy coding strips the remaining fat. JPEG uses two techniques in sequence: **zigzag + RLE** turns 64 coefficients into a stream like '(skip=5, value=3), (skip=12, value=1), EOB', then **Huffman** gives common pairs a short code and rare ones a long code.
Shannon proved in 1948 the lower bound: no lossless algorithm can compress a stream below its entropy. Huffman approaches this bound to within 1 bit per symbol, arithmetic coding hits it. CABAC inside H.264 is smarter still: symbol probabilities are estimated adaptively from context (neighbouring blocks, frame type). The same idea sits in LLMs: token probabilities estimated from a context window. LLMs generate, CABAC compresses - a formal duality.
MP3 and AAC are built on a different principle. Not DCT but MDCT - a modified DCT with overlapping windows to avoid audible seams. Not a spatial quantization matrix but a psychoacoustic model: which frequencies are masked by louder neighbours, which sit below the audibility threshold in this band. Quantize what the ear will not hear. The principle is universal: where the senses are blind, bits get cheap.
After quantization 60 of 64 coefficients are zero. After zigzag + RLE roughly how many symbols feed into Huffman?
Lossy vs Lossless: where the line sits
Lossless algorithms (PNG, FLAC, gzip) drop no bits and compress by removing statistical redundancy: repeats, neighbour-pixel correlations, predictable patterns. PNG predicts each pixel from neighbours and codes the residual via DEFLATE - typical ratios 2-3x. Lossy formats (JPEG, MP3, H.264) throw away information the sensor cannot perceive - and reach 10-50x on the same images. This is not about math. It is about a model of human perception.
Modern standards are drifting toward neural compression. NVIDIA's AV1 encoder uses ML for motion-vector estimation. Google's HiFiC applies a GAN for perceptually-optimized image compression. Stability AI shipped the Stable Diffusion VAE as effectively an extreme encoder of 8x8 pixel blocks into a 1x1 latent point - 64x lossy compression with semantic preservation. The boundary between 'compression' and 'generation' is dissolving: the VAE decoder hallucinates details that were not in the source, but plausibly belong.
Bitrate vs perceptual quality is not linear. The rate-distortion curve looks like a hyperbola: the first bitrate doublings give a sharp quality lift, then returns flatten to a plateau. YouTube picks bitrate right at the knee, where shaving 30% off is imperceptible but going further becomes visible. The process is automated: every video is transcoded into several profiles and the cheapest one that passes perceptual evaluation gets served.
Lossy compression 'damages' data, lossless is 'honest'. Serious work always demands lossless.
Lossy and lossless answer different questions. Lossless: 'preserve an exact representation, squeeze the redundancy'. Lossy: 'preserve what the consumer perceives, drop the rest'. Video as a product would not exist without lossy - bitrate would be 100x more expensive.
The choice is by goal, not philosophy. Archiving a source photograph - lossless (DNG, RAW). Distribution - lossy (JPEG, AVIF). Professional audio mastering - WAV/FLAC. Streaming - Opus/AAC. Medical scans before diagnosis - lossless DICOM. After archiving and anonymization - lossy JPEG2000. The tech is picked per pipeline stage, not as a philosophical stance.
Stable Diffusion VAE compresses a 512x512 RGB image into a 64x64x4 latent. What is the compression ratio and why is this not classical JPEG?
Related topics
Media compression rests on three pillars - spectral decorrelation, psychovisual perception models, entropy coding:
- Fourier and spectral analysis — DCT is a special case of Fourier on real even functions
- Wavelet transform — JPEG2000 replaced DCT with wavelets for multiresolution compression
- Shannon entropy — Sets the lower bound on compressed file size
- Quantization in ML — Same trick in LLMs: int4 weights save 8x GPU memory
Key ideas
- **DCT** concentrates the energy of natural images into the first coefficients - not lossless, but a cheap base for subsequent quantization
- **Quantization** is the single lossy step. Divisors from the table set the size-quality trade-off. Psychovisual model: different aggression for Y vs Cb/Cr
- **Zigzag + RLE + Huffman/CABAC** pushes compression down to Shannon's entropy bound. JPEG artifacts (blocking, ringing) are the physical signature of aggressive high-frequency quantization
- **Lossy vs Lossless** - not philosophy, but a choice per pipeline stage. Archive - lossless. Distribution - lossy with perceptual quality control
- **Neural compression** (VAE, GAN-based) - the next iteration: the decoder hallucinates details from a learned prior, the boundary with generation dissolves
Вопросы для размышления
- What happens if quantization is run with an all-ones table (Q[i][j]=1)? Is the result lossless? What happens to file size?
- Why does MP3 use MDCT with overlapping windows rather than plain DCT? What would the sound do at block boundaries without overlap?
- Is Stable Diffusion VAE compression or generation? How does the boundary differ from a JPEG decoder 'guessing' detail at quality=10?
Связанные уроки
- dsp-04 — Fourier and spectral analysis are the foundation for DCT
- dsp-13 — Wavelet in JPEG2000 replaced DCT - same multiresolution idea
- it-04 — Huffman and arithmetic coding are the final stage of compression
- it-01 — Shannon entropy sets the lower bound on compression
- aie-25-multimodal — Vision-LLMs see images already as JPEG/H.264 - artifacts affect quality
- cv-04 — ConvNets train on JPEG-compressed ImageNet datasets
- calc-01-sequences