Digital Signal Processing

Spectral Analysis: STFT, Spectrograms and MFCC

Bell Labs, 1964. Engineers hold a printout: voice visible as a 2D map - time left to right, frequency bottom to top. Sixty years later, the same principle powers Shazam (70M tracks, identified in 5 seconds), OpenAI Whisper, Google Speech-to-Text, and Apple Siri. The spectrogram is the language in which neural networks hear the world.

**Shazam** stores spectrogram peak hashes - not audio; a pattern of 5-10 points uniquely fingerprints any track
**OpenAI Whisper** ingests an 80-channel mel-log spectrogram - exactly what this lesson builds
**Spotify** analyzes mel spectrograms for genre, tempo, and instrument classification across 100+ million tracks
**Edge ASR** on microcontrollers (Arduino Nano 33 BLE) runs on 13 MFCC - 39 numbers per frame instead of hundreds

STFT and Windowing

**Bell Labs, 1964.** An engineer holds a printout: horizontal axis - time, vertical - frequency, brightness - loudness. Not an oscilloscope. Not a plain FFT. For the first time in history, a human voice is visible as a two-dimensional picture - phoneme by phoneme, word by word. The Vocoder had just grown eyes.

The problem with FFT is that it is global. Feed in a 3-second clip - get one frequency snapshot for the whole thing. A note at the beginning and a note at the end are entirely different events. Global FFT blends them together.

STFT (Short-Time Fourier Transform) solves this bluntly but brilliantly: slice the signal into short overlapping chunks - **frames** - and run a separate FFT on each. Formally:

Here $m$ is the frame index, $H$ is the hop size, $w(n)$ is the window function. The result $X(m, k)$ is a 2D matrix: rows - time (frames), columns - frequencies.

Why windowing matters - and why a rectangle is a poor choice

Hard-truncating the signal at frame boundaries creates a sharp discontinuity. To FFT, that discontinuity looks like broadband noise. **Spectral leakage**: energy from one frequency bleeds into adjacent bins.

The Hann window smoothly tapers the edges to zero:

In practice: 25 ms window, 10 ms hop, 50-75% overlap. This is the default in Kaldi, Whisper, and librosa - not arbitrary, but the result of 60 years of speech research.

**The time-frequency tradeoff (uncertainty principle):** a long window gives sharp frequency resolution but blurry time resolution. A short window gives the opposite. This is not a limitation of STFT - it is a mathematical necessity, the same Heisenberg uncertainty principle that governs quantum mechanics.

What happens if the window function is removed and the frame is cut with a hard rectangular boundary?

Spectrograms: Sound as a Picture

Shazam does not store audio. Shazam stores **spectrogram fingerprints**. The algorithm finds brightness peaks in a time-frequency matrix, builds a hash from their pattern - and searches 70 million tracks in under 5 seconds. No spectrograms, no Shazam.

A spectrogram is simply the squared magnitude of the STFT:

The matrix $S$ is visualized as a heatmap: horizontal - time, vertical - frequency, brightness - power. Vowels draw horizontal streaks (formants), consonants make blurry vertical bursts, a piano note appears as a crisp horizontal stripe.

Log-power: why linear scale fails

Audio dynamic range spans 60-120 dB. A whisper and a snare drum differ in power by a factor of one million. On a linear scale, the voice collapses into a thin line near zero while every clipped peak floods the screen.

Converting to dB compresses the range:

The small $\epsilon$ (typically 1e-10) prevents $\log(0)$. Now 120 dB of dynamic range fits into a convenient numerical range. This is exactly what librosa.amplitude_to_db and torchaudio.transforms.AmplitudeToDB do.

**For neural networks:** normalize the spectrogram to [-1, 1] or apply z-score normalization across the full dataset. Unnormalized spectrograms are one of the top three causes of unstable audio model training.

Why is logarithmic power (dB) used for spectrogram visualization instead of linear scale?

Mel Scale and MFCC: How the Ear Hears Frequency

The human ear is not linear. The perceptual difference between 100 Hz and 200 Hz is enormous. The difference between 8000 Hz and 8100 Hz is nearly inaudible. The linear frequency axis of STFT wastes half its bins on a range where the ear is nearly blind.

The mel scale compresses high frequencies and stretches low ones, mimicking the psychoacoustics of the cochlea:

In practice, one builds a bank of triangular filters spaced uniformly on the mel scale (typically 40-128 filters) and applies them to the power spectrum. The result - a **mel spectrogram** - has far fewer dimensions but is aligned with human perception.

MFCC: 13 numbers instead of a thousand

Mel Frequency Cepstral Coefficients have been the standard for speech recognition since the 1980s and remain in active use. They are derived from the mel spectrogram in three steps: log, Discrete Cosine Transform (DCT), keep the first 13 coefficients.

DCT here decorrelates the mel filterbank channels: adjacent mel bands are highly correlated, and DCT rotates the space so the first few coefficients carry most of the information while the rest can be discarded.

OpenAI Whisper ingests an 80-channel mel-log spectrogram fixed at 3000 frames (30 seconds at 16 kHz). Not MFCC - mel spectrogram: modern neural networks prefer to learn from the "raw" 80 mel channels rather than DCT-compressed 13. MFCC lives on in classical HMM systems and edge devices with constrained memory.

**Delta and delta-delta:** the standard HMM/Kaldi pipeline appends the first and second time derivatives of the 13 MFCCs (velocity and acceleration of change). Result: 39 numbers per frame - the HTK standard for decades.

MFCC are obsolete and no longer used

MFCC are alive on edge devices (microcontrollers, hearing aids) and in HMM-based systems

13 numbers per frame vs. 80 is critical when RAM is 256 KB. TinyML pipelines are still built on MFCC.

What does OpenAI Whisper take as input?

Key Ideas

**STFT = FFT per frame:** signal sliced into short overlapping windows, FFT on each - result is a 2D matrix (time x frequency)
**Window functions:** Hann/Hamming windows suppress spectral leakage; rectangular window damages the spectrum at hard edges
**Spectrogram:** |STFT|^2 in dB - visually and numerically convenient; log scale compresses 120 dB of dynamic range
**Mel scale:** logarithmic compression of high frequencies matching cochlear psychoacoustics; 40-128 filters vs. 256+ linear bins
**MFCC:** mel + log + DCT + 13 coefficients = edge ASR standard; neural networks prefer raw 80 mel channels

Connected Topics

STFT and mel spectrograms are the entry point for most modern audio pipelines:

FFT and the Cooley-Tukey Algorithm — STFT = repeated FFT on windowed frames
Whisper (Speech-to-Text) — Ingests 80-channel mel-log spectrogram as the input tensor
Signal Filtering — Frequency-domain filters are applied to the STFT matrix

Вопросы для размышления

Why does increasing the STFT window length improve frequency resolution but degrade time resolution - and how does this connect to the Heisenberg uncertainty principle?
Whisper uses 80 mel channels while classical HMM systems use 13 MFCC. In which scenario are 13 MFCC preferable despite carrying less information?
If two different sounds produce the same mel spectrogram, can the original signal be recovered from it - and what does this say about the invertibility of the mel transform?

Связанные уроки

dsp-05 — FFT is the building block of STFT - each STFT frame is one FFT call
dsp-07 — Mel spectrograms are the standard input to filterbank and DNN audio models
aie-23-speech-to-text — Whisper ingests an 80-channel mel spectrogram - exactly what is built here
calc-01-sequences — The Hann window is a weighted sequence with smooth decay - same convergence intuition as series
alg-01-big-o — STFT over N frames x FFT per frame: total cost O(N * M log M)