Trigonometry

Discrete Cosine Transform: JPEG and MP3

Цели урока

Compute DCT-II and explain its relationship to the DFT
Understand energy compaction and the connection to KLT
Describe the JPEG pipeline: DCT -> quantization -> zigzag -> Huffman
Explain the role of MDCT in MP3 and the overlap-add principle

Предварительные знания

Fourier Series
DFT and FFT
Wavelet Analysis

Why does JPEG compress 12 MB to 1 MB without visible loss - and what exactly is thrown away?

JPEG: 12:1 photo compression, standard since 1992
MP3/AAC: 10:1 audio compression via MDCT and psychoacoustics
HEVC/H.265: Netflix 4K streaming at 15-20 Mbps
WebP: next-generation image format with fewer artifacts than JPEG

History: from the KLT theorem to the ISO standard

The Karhunen-Loeve transform (1946/1960) is the optimal data decorrelation, but computationally expensive. Ahmed, Natarajan, and Rao in 1974 showed that DCT asymptotically coincides with KLT for Markov data - and this is computable in O(N log N). The JPEG committee adopted DCT-II as the basis of the standard in 1992. MP3 was standardized in 1993 via MPEG-1 Audio Layer III with MDCT. Today DCT underlies JPEG, MP3, AAC, H.264, H.265 - virtually all multimedia content on the internet.

DCT-II and Energy Compaction

The JPEG standard compresses an iPhone photograph 12-fold: 12 MB becomes 1 MB. The discrete cosine transform concentrates 99% of image energy into 10% of coefficients - the rest are quantized to zero. Netflix streams 4K via HEVC, where block DCT coding is analogous to JPEG but with blocks up to 64x64.

DCT-II can be computed via FFT: pad with zeros, take FFT, and extract the real part. This gives O(N log N) instead of O(N^2) direct computation.

Why is DCT-II more effective than DFT for image compression?

Natural images are statistically approximated by a first-order Markov process. DCT asymptotically coincides with the optimal Karhunen-Loeve transform (KLT), decorrelating the data as efficiently as possible.

JPEG: Quantization and Encoding

JPEG works like this: divide the image into 8x8 blocks, apply DCT to each, divide coefficients by a quantization matrix Q (larger values for high frequencies), round to integers. High-frequency coefficients invisible to the eye become zero. Zigzag scanning converts the matrix to a vector - long runs of trailing zeros compress well with RLE plus Huffman coding.

JPEG blocking artifacts (visible squares at high compression) result from independent quantization of 8x8 blocks. JPEG 2000 and WebP use DWT and are free of this defect.

Why does JPEG use 8x8 pixel blocks rather than the entire image?

Small blocks: fast coding, preserves local spatial correlation, but causes blocking artifacts at high compression. Large blocks: better use of global statistics, but slow and boundary effects degrade quality.

MDCT in MP3 and AAC

MP3 compresses audio 10-fold without perceptible loss. The key tool is MDCT with 50% window overlap. This eliminates blocking artifacts between frames, which would be audible as clicks. Spotify streams at 320 kbps AAC, requiring processing of 44100 samples/s through overlapping MDCT windows in real time.

Bitrate and perceived quality in MP3

Quality vs. bitrate trade-off

128 kbps: artifacts audible on close listening. 192 kbps: most listeners cannot distinguish from CD. 320 kbps: effectively indistinguishable. FLAC (lossless): 700-1100 kbps. AAC at 256 kbps is perceived as better than MP3 at 320 kbps.

Why does MDCT in MP3 use 50% window overlap?

Independent coding of non-overlapping blocks creates audible discontinuities at boundaries. MDCT takes 2N samples for N coefficients - each sample participates in two frames. Overlap-add cancels the boundary jumps.

Connections to other topics

DCT links trigonometric analysis to engineering compression standards

JPEG and HEVC — Related topic
MP3 and AAC — Related topic
KLT and statistics — Related topic
Filter banks — Related topic

Итоги

DCT-II concentrates energy in few coefficients, asymptotically optimal (KLT) for Markov data - the reason JPEG works
JPEG: DCT of 8x8 block -> quantization by matrix Q -> zigzag scan -> RLE + Huffman; quality controlled by scaling Q
MDCT with 50% overlap eliminates blocking artifacts in MP3/AAC via overlap-add reconstruction
All multimedia on the internet - JPEG, MP3, H.264, H.265 - is built on one idea: DCT and quantization of high frequencies

Вопросы для размышления

Why does quantizing high-frequency coefficients produce less visible artifacts than quantizing low-frequency ones?
How does psychoacoustic masking allow non-uniform quantization in MP3?
What is the fundamental improvement of JPEG 2000 over JPEG, and why does JPEG still dominate?

Связанные уроки

trig-21 — Fourier series are the theoretical foundation of DCT
trig-24-wavelets — DCT and DWT are competing approaches; JPEG 2000 uses DWT
trig-26-trig-poly — DCT basis functions are trigonometric polynomials with symmetric boundary conditions