Digital Signal Processing

FFT: Fast Fourier Transform

DFT: $O(N^2)$. FFT: $O(N \log N)$. At N = 1,000,000 the difference is 50,000x. MP3, JPEG, and 5G all depend on one algorithm - first sketched by Gauss in 1805 and then forgotten for 160 years.

Whisper (OpenAI): mel-spectrogram via STFT (FFT under the hood) as input representation - without FFT, real-time speech recognition is computationally infeasible
5G NR: OFDM uses a 4096-point FFT for every symbol every 71 microseconds - 14,000 FFTs per second on a single carrier
JPEG/HEIC: 2D DCT (a relative of FFT) compresses every 8x8 pixel block - a 35-year-old standard still universal today
GPS: cross-correlation via FFT for satellite signal acquisition - cold start in seconds, not minutes

Cooley-Tukey: divide and conquer

**FFT was invented in 1965 by James Cooley and John Tukey. Then historians found the same algorithm in Gauss's notebooks from 1805.** The algorithm waited 160 years for its moment - until computers became capable enough to make its value concrete. The core insight: a DFT of size N decomposes into two DFTs of size N/2 - one for even-indexed samples, one for odd-indexed samples. This recursion transforms $O(N^2)$ into $O(N \log N)$.

**Direct DFT:** $X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j2\pi kn/N}$. For each of N values of k, N multiplications are needed - total $O(N^2)$. At N = 1,000,000 that is $10^{12}$ operations. FFT reduces this to $N \log_2 N \approx 2 \times 10^7$ operations - a 50,000x difference.

The Cooley-Tukey decomposition exploits the periodicity of $e^{-j2\pi kn/N}$. The complex multiplier $W_N^k = e^{-j2\pi k/N}$ is called a twiddle factor. The DFT of even-indexed samples ($x[0], x[2], x[4], \ldots$) and odd-indexed samples ($x[1], x[3], x[5], \ldots$) are computed independently, then combined via the butterfly operation: $X[k] = E[k] + W_N^k \cdot O[k]$, where $E$ and $O$ are the DFTs of the even and odd halves.

Whisper (OpenAI) uses STFT - a short windowed FFT sliding across the audio. From the STFT output, a mel-spectrogram is constructed, which serves as the transformer input. Without FFT, real-time speech recognition would be computationally infeasible.

What is the core idea of the Cooley-Tukey algorithm?

Radix-2 DIT: bit-reversal and butterfly

**Radix-2 DIT (decimation-in-time) is the most widely used form of FFT.** Requirement: N must be a power of 2 (512, 1024, 2048...). The recursion unfolds into $\log_2 N$ stages of butterfly operations. Each stage processes all N elements - total $N \log_2 N$ multiplications instead of $N^2$. The input must be reordered in bit-reversal permutation: index n is written in binary, its bits are reversed - this is the condition for correct in-place computation.

**Bit-reversal example for N=8:** indices 0-7 in binary - 000,001,010,011,100,101,110,111. After bit-reversal: 000,100,010,110,001,101,011,111 - giving order 0,4,2,6,1,5,3,7. Elements arrive at the first butterfly stage in exactly this order. numpy.fft.fft handles this internally and automatically.

Each butterfly operation connects pairs of elements: $a' = a + W \cdot b$, $b' = a - W \cdot b$. This is in-place: $a'$ and $b'$ overwrite positions $a$ and $b$. After $\log_2 N$ stages the array holds the complete DFT coefficients. In music applications this structure enables reverb, EQ, and pitch-shift in real time - the FFT of a 23 ms block (1024 samples at 44100 Hz) completes in microseconds.

numpy.fft.fft accepts arrays of any length, not just powers of 2 - internally it uses a mixed-radix algorithm (radix-4, radix-8 for lengths of the form 2^a * 3^b * 5^c). For lengths that are powers of 2 it is still fastest. FFTW automatically selects the optimal plan for the specific N and hardware.

Why does radix-2 FFT require the signal length N to be a power of 2?

O(N log N): where the speed comes from

**FFT recurrence:** $T(N) = 2T(N/2) + O(N)$. By the Master Theorem (case 2) this gives $T(N) = O(N \log N)$. Interpretation: a problem of size N splits into two subproblems of size N/2 (recursion) and is merged in $O(N)$ (butterfly stage). At N = $2^{20} \approx 10^6$, naive DFT requires $10^{12}$ operations, FFT requires $2 \times 10^7$. On a modern CPU that is the difference between 16 minutes and 50 milliseconds.

**5G OFDM in practice:** each OFDM symbol uses a 4096-point IFFT at the transmitter and a 4096-point FFT at the receiver. Symbols are transmitted every 71 microseconds. That is 14,000 FFTs per second - for a single carrier. Without $O(N \log N)$ complexity, fifth-generation mobile communication would be physically impossible.

FFTW (Fastest Fourier Transform in the West) automatically selects the optimal algorithm for a given N and hardware. On the first call, FFTW benchmarks multiple factorization plans and stores the result in a WISDOM file. scipy.fft uses FFTW or pocketfft under the hood on most platforms. For prime N, Bluestein or Rader algorithms are used - complexity remains $O(N \log N)$.

GPS receivers use FFT for cross-correlation of the received signal against known PRN codes from satellites. Without FFT, satellite acquisition would take seconds instead of milliseconds - cold-start GPS would be unacceptably slow.

From the recurrence T(N) = 2T(N/2) + O(N) follows O(N log N) complexity. What does the O(N) term represent?

numpy.fft, scipy.fft, cuFFT: practice

**1994. Fraunhofer Institute. MP3 is ready. Under the hood - Modified DCT (MDCT), a close relative of FFT. 128 kbps vs 1411 kbps CD audio - an 11x difference at nearly indistinguishable quality.** In practice FFT appears everywhere: numpy.fft.fft for the general case, scipy.fft.rfft for real-valued signals (output is N/2+1 points instead of N - exploiting Hermitian symmetry), cuFFT on GPU for large batches. Whisper mel-spectrogram pipeline: librosa.stft (FFT) -> magnitudes -> mel filterbank -> log -> normalization -> transformer.

**scipy.fft vs numpy.fft:** scipy.fft.fft is faster for large N due to FFTW or pocketfft backend. rfft takes N real numbers and returns N//2+1 complex values - half the memory and operations. irfft reconstructs the original signal. For batches: scipy.fft.fft accepts a workers=-1 parameter for multithreaded execution.

JPEG uses 2D DCT (discrete cosine transform) on each 8x8 pixel block. DCT is a special case of DFT for real even signals. MRI reconstruction uses 2D FFT (k-space to image space). CT scanners use filtered back-projection with FFT convolution. FFT is a universal signal processing tool, not just for audio.

FFT is an algorithm exclusively for audio and sound signal processing

FFT is a universal algorithm for linear transforms, applicable wherever convolution or spectral analysis is needed: images (JPEG, HEIC use 2D DCT), medical imaging (MRI, CT), telecommunications (OFDM in 5G, LTE, WiFi), GPS, astronomy

Historically FFT first found mass adoption in audio processing (telephony, MP3), creating a strong association with sound. Mathematically, FFT is simply a fast method to compute the DFT, and DFT applies to any discrete signal: pixel arrays, sensor measurements, financial time series

Why does scipy.fft.rfft return only N//2+1 coefficients instead of N for a real-valued input signal?

Key ideas

Cooley-Tukey splits a DFT of size N into two DFTs of size N/2 (divide and conquer on even/odd indices), reducing $O(N^2)$ to $O(N \log N)$ - with no loss of precision
Radix-2 DIT requires N to be a power of 2, uses bit-reversal permutation for in-place computation across $\log_2 N$ butterfly stages
At N = $2^{20}$, FFT is 50,000x faster than naive DFT; FFTW automatically selects the optimal plan for the hardware
scipy.fft.rfft for real-valued signals returns N//2+1 coefficients, exploiting Hermitian symmetry of the DFT

Вопросы для размышления

Gauss's 1805 algorithm and the Cooley-Tukey 1965 algorithm are mathematically identical. What changed in 160 years - and why did that change make the algorithm practically relevant?
5G OFDM uses a 4096-point FFT every 71 microseconds. What latency constraints does this impose on computation - and why is $O(N \log N)$ specifically critical here, rather than simply a faster processor?
scipy.fft.rfft returns half as many coefficients for a real signal. Under what conditions would it make sense to use fft instead of rfft - and does rfft lose any information?