Digital Signal Processing

Discrete Fourier Transform (DFT)

Spotify analyzes 100 million tracks for recommendations by BPM, key, and genre. All of it starts with DFT - transforming 44,100 numbers per second into a list of frequencies. Shazam identifies a song from 5 seconds of noisy audio against a database of 100 million tracks in under a second. MP3 achieves 10:1 compression with no audible loss. None of this exists without DFT.

**Shazam**: computes DFT every 100 ms, finds characteristic spectral peaks, matches them against 100M tracks in <1 second via locality-sensitive hashing
**MP3/AAC**: MDCT (modified DFT) splits audio into frequencies, a psychoacoustic model identifies what ears cannot hear, the codec discards it - 10:1 compression without audible loss
**ECG monitors**: the 50/60 Hz power-line interference in cardiac signals is identified by DFT and removed with a notch filter without distorting the actual heartbeat

Historical context

In 1965, Cooley and Tukey published the Fast Fourier Transform algorithm in Mathematics of Computation. Before FFT, computing DFT for 1024 samples required about 1,000,000 operations - O(N²). FFT reduced this to N*log₂(N) = 10,240 operations - a 100x speedup. It was later discovered that Gauss knew this algorithm in 1805 but never published it. FFT is consistently ranked among the 10 most important algorithms of the 20th century and made real-time digital signal processing economically viable. The CIA reportedly used this publication to monitor compliance with nuclear test-ban treaties by analyzing seismic frequency data.

DFT: decomposing a signal into frequencies

Bell Labs, 1965. An engineer records speech on tape and needs to remove 400 Hz noise - but with 44,100 samples per second in the time domain, there is no way to identify whether that frequency is even present. The frequency domain reveals what the time domain hides. DFT does exactly this: converts N time-domain samples into N complex coefficients, each describing one frequency component. Formula: X[k] = Σ(n=0..N-1) x[n] * e^(-j*2π*k*n/N). Direct computational complexity: O(N²).

Parameter	Formula	Example at Fs=44100, N=1024
Frequency resolution	Δf = Fs / N	44100 / 1024 ≈ 43 Hz per bin
Max frequency (Nyquist)	Fs / 2	22050 Hz
Time resolution	T = N / Fs	1024 / 44100 ≈ 23 ms
Naive DFT complexity	O(N²)	N=1024: ~1,000,000 operations

A signal recorded at Fs=8000 Hz, block size N=1024 samples. What is the DFT frequency resolution?

Frequency domain: what the coefficients mean

DFT returns N complex numbers X[k]. Each is the projection of the signal onto a sinusoid of frequency k*Fs/N. A complex number encodes both amplitude (|X[k]|) and phase (arg(X[k])). For a real-valued signal: X[N-k] = X[k]* (complex conjugate) - the left half of the spectrum mirrors the right, so only k=0..N/2 carries independent information.

**DC component X[0]:** mean value of the signal (constant offset). If the signal is symmetric around zero, X[0] ≈ 0. **X[N/2]:** component at the Nyquist frequency. For a real signal, X[N/2] is always real-valued.

DFT of a real signal with N samples returns N complex coefficients. How many independent frequency components does the result contain?

Spectrum: amplitude and phase

The amplitude spectrum |X[k]| shows the strength of each frequency. The phase spectrum arg(X[k]) shows its shift. For most practical tasks (EQ, noise reduction, pitch detection) the amplitude spectrum is sufficient. Phase becomes critical when modifying a signal before inverse IDFT - ignoring phase produces audible artifacts.

**Normalization:** raw |X[k]| is scaled by N. For a single-sided spectrum (positive frequencies only): amplitude = 2*|X[k]|/N. For DC (k=0) and Nyquist (k=N/2) the factor of 2 is not applied. Without normalization, comparing spectra from different block sizes is meaningless.

A 1000 Hz signal has amplitude 0.5. After DFT with N=1024 at Fs=44100 Hz, what is the raw |X[k]| at bin k=23 (1000 Hz)?

Spectral Leakage: artifact of the finite window

The main DFT artifact: if the signal frequency does not land exactly on a bin (k*Fs/N), energy 'leaks' into neighboring bins. A 441 Hz tone at Fs=8000, N=1024 falls between bin 56 (440.6 Hz) and bin 57 (448.4 Hz). The spectrum shows a smeared peak with sidelobes instead of a sharp spike at 441 Hz. The cause: DFT implicitly assumes the block repeats periodically - a non-integer frequency creates a discontinuity at the boundary.

Window function	Sidelobe level	Main lobe width	Application
Rectangular	-13 dB	1 bin	No leakage only when frequency is exactly on a bin
Hanning	-31 dB	2 bins	Audio processing, spectral analysis
Hamming	-41 dB	2 bins	Filtering, speech processing
Blackman	-58 dB	3 bins	When maximum leakage suppression is needed
Kaiser	-80+ dB	configurable	High-precision DSP filters

A window function completely eliminates spectral leakage

A window only reduces leakage by redistributing sidelobe energy. This widens the main lobe - reducing frequency resolution.

Fundamental trade-off: narrow peak (good resolution) versus low sidelobes (less leakage). Rectangular window: maximum resolution, maximum leakage. Blackman: minimum leakage, minimum resolution. No window wins on both.

A spectrum shows a wide smeared peak instead of a sharp spike at 1000 Hz. What is the cause and the fix?

Key ideas

DFT converts N time-domain samples into N complex frequency coefficients: X[k] = Σ x[n] * e^(-j2πkn/N)
Frequency resolution Δf = Fs/N; maximum frequency = Fs/2 (Nyquist)
Amplitude at frequency k: |X[k]|, normalization: 2*|X[k]|/N for single-sided spectrum
Spectral leakage occurs when signal frequency is not a multiple of Fs/N - apply window functions
Naive DFT is O(N²); FFT (Cooley-Tukey, 1965) computes the same result in O(N log N)

Вопросы для размышления

Shazam recognizes a song in under 1 second even with background noise. Why does recognition use amplitude spectrum peaks rather than full phase information? What is lost by ignoring phase?
Increasing block size N improves frequency resolution but worsens time resolution. How does this trade-off affect real-time audio applications like live pitch detection?
Why does JPEG use DCT (a DFT variant) on 8x8 pixel blocks rather than the entire image at once?

Связанные уроки

dsp-03 — Sampling theorem and Nyquist - the input data that DFT analyzes
dsp-05 — FFT is the fast O(N log N) algorithm for computing DFT - built entirely on DFT theory
sci-04 — SVD and DFT both decompose data into orthogonal components ranked by importance
dsp-06 — Window functions address spectral leakage - the core artifact introduced here
qc-04 — Quantum Fourier Transform is the quantum analog of DFT, running exponentially faster on quantum hardware
opt-04
trig-10