AI Engineering
Speech-to-Text: Whisper, Deepgram, Browser API - Speech Recognition in Production
Цели урока
- Understand the difference between batch STT (Whisper) and streaming STT (Deepgram) and choose the right one for the task
- Learn to transcribe audio via OpenAI Whisper API with timestamps and subtitles
- Implement real-time transcription via Deepgram WebSocket API with interim/final results
- Build a NestJS endpoint for audio upload with format and size validation
- Optimize STT accuracy through language hints, prompts, audio preprocessing and LLM post-processing
Whisper from OpenAI (September 2022) beat Google Speech on WER across most languages - and shipped as open-source. This wasn't just a new API. It was the end of the era where STT cost `USD 0.024/min` and required an enterprise contract. In months, prices dropped 4x, quality improved, and the barrier vanished. Deepgram Nova-2 answered: `USD 0.0059/min`, 300ms latency, streaming out of the box. The STT market reshuffled in a year.
- Telegram transcribes voice messages for Premium users - millions of requests per day through their own STT infrastructure
- Zoom generates live captions and meeting summaries - streaming STT with speaker diarization in real time
- Otter.ai and Notion AI transcribe meetings with automatic summarization via LLM pipeline - batch Whisper plus gpt-4o-mini post-processing
- AssemblyAI builds a product on top of Whisper with PII redaction, sentiment analysis, speaker labels - each a separate task layered on top of raw transcription
From Dragon to Whisper: 30 Years of Speech Recognition
**1990s**: Dragon NaturallySpeaking - first mass-market STT, `USD 150` per license, required hours of voice training per user, desktop-only. **2012**: Google Speech API - first cloud STT with a REST interface, but paid and restricted. **2016**: Google launches public Speech-to-Text API - `USD 0.024/min`, enterprise tier. **2019-2021**: AWS Transcribe, Azure Speech - similar pricing, different language support. **September 2022**: OpenAI releases Whisper on GitHub (the large-v3 checkpoint follows in November 2023). Apache 2.0. Best-in-class WER. Free for self-hosted. Hosted API at `USD 0.006/min`. The STT market stopped being an enterprise monopoly in a single day.
Предварительные знания
The STT Landscape: From Whisper to Cloud Services
2022. Google Speech-to-Text dominates the market - `USD 0.024` per minute, enterprise contract required for decent WER. AWS Transcribe: same price. Azure: same story. The barrier to entry was real: shipping speech recognition meant paying enterprise prices or building from scratch.
September 2022. OpenAI publishes Whisper - and drops it on GitHub. Open-source, Apache 2.0. WER beats Google on most languages. Zero license cost. Within months, Whisper became the standard - not because of marketing, but because it won the benchmarks.
| Provider | Model | Latency | Price per Minute | Streaming |
|---|---|---|---|---|
| OpenAI | Whisper large-v3 | Batch (2-10 sec) | USD 0.006/min | No |
| Deepgram | Nova-2 | Real-time (~300ms) | USD 0.0059/min | Yes (WebSocket) |
| Google Cloud | Chirp 2 | Real-time (~500ms) | USD 0.012/min | Yes (gRPC) |
| AWS Transcribe | Custom | Real-time (~800ms) | USD 0.024/min | Yes (WebSocket) |
| AssemblyAI | Universal-2 | Batch or RT | USD 0.0065/min | Yes (WebSocket) |
| Whisper (self-hosted) | large-v3 | Depends on GPU | GPU cost only | With modifications |
The key split in that table isn't price - it's architecture. Two different approaches:
**Whisper is not an API - it's a model.** OpenAI Whisper is an open-source model that can run locally on a GPU. OpenAI Whisper API is a hosted version on OpenAI's servers. Deepgram Nova-2, Google Chirp - these are proprietary models from their respective companies, unrelated to Whisper.
For a voice assistant that needs to react to speech in real time (latency < 500ms), which STT approach is appropriate?
OpenAI Whisper API: Transcription and Translation
OpenAI Whisper API is the simplest way to add STT to an application. Two endpoints: `/audio/transcriptions` (speech recognition) and `/audio/translations` (translate any language to English). Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm. Maximum file size - 25 MB.
**SRT/VTT subtitles in a single call.** Whisper API can return ready-made subtitles: `response_format: 'srt'` or `response_format: 'vtt'`. No need to parse timestamps manually - the format is ready for use in video players.
A one-hour meeting recording (180 MB, mp3). How to transcribe it through Whisper API?
Real-time STT: Deepgram Streaming via WebSocket
Deepgram Nova-2 is a specialized STT provider with a WebSocket API for real-time transcription. Audio is sent as a continuous stream, and text is returned with ~300ms latency. The key feature is interim results: preliminary text that updates as audio arrives.
VAD (voice activity detection) is built in. Deepgram detects when a phrase starts and ends on its own. No need to implement pause detection manually - `utterance_end_ms` sets the silence threshold for phrase boundaries.
**Interim vs Final results.** Interim results are preliminary transcriptions that update with each audio chunk. Use interim results for UI display (like live captions), but for processing (search, saving, sending to LLM) - only final results. Otherwise the text will contain duplicates and errors.
In streaming STT (Deepgram), which results should be used for sending text to an LLM?
NestJS: Endpoint for Audio Upload and Transcription
A production-ready audio endpoint must: 1. accept files via multipart/form-data 2. validate format and size 3. transcribe via STT API 4. return the result. For large files - processing via a queue with WebSocket notification upon completion.
**`toFile()` from the openai SDK.** Whisper API accepts a file, not a Buffer. The `toFile()` function from the `openai` package converts a Buffer into an API-compatible object. Without it, the Buffer would need to be written to disk and read via `fs.createReadStream()`.
When uploading audio via multipart/form-data in NestJS, why can't the Content-Type be set manually on the client?
Improving Accuracy: Language Hints, Prompts, Post-processing
Whisper large-v3 delivers ~5% WER on clean English speech. Sounds great - until the first real meeting recording hits. Office noise, accents, "NestJS" transcribed as "nest jay ess", mixed-language code reviews where half the words are technical terms. WER can easily jump to 15-25% without optimization.
Three levers - and each one is free or nearly free compared to the cost of transcription itself:
| Optimization | Effect on WER | Cost |
|---|---|---|
| Language hint | −10-15% errors | Free |
| Prompt with terms | −5-20% errors on domain terms | Free |
| Audio preprocessing | −5-10% errors on noisy recordings | CPU time |
| LLM post-processing | −15-30% errors, punctuation | USD 0.001-0.005 per request |
| Speaker diarization | Speaker separation (not accuracy) | +USD 0.01/min |
**WER (Word Error Rate)** is the standard STT quality metric. WER = (substitutions + insertions + deletions) / total words. Whisper large-v3 achieves ~5% WER on clean English speech, ~8-12% on Russian. On real-world noisy data, WER can be 15-25% without optimization.
What provides the greatest accuracy boost for STT when transcribing technical meetings with domain terminology?
STT just converts speech to text
Raw transcription is one layer. Punctuation restoration, speaker diarization, language detection, PII redaction, sentiment analysis - each of these is a separate task solved on top of the base transcript
Whisper returns text with minimal or no punctuation. Deepgram's smart_format adds punctuation through a separate post-processing step. Speaker diarization (who said what) is a separate ML task - AssemblyAI charges +`USD 0.01/min` for it. Language detection is another task. VAD (voice activity detection) is another. A production voice pipeline is a stack of 5-7 separate components, not a single API call.
Summary
- Batch STT (Whisper API) - file upload, wait 2-10 sec, high accuracy. Streaming STT (Deepgram Nova-2) - continuous audio stream, text in ~300ms
- Whisper API: whisper-1 model, 25 MB limit, response formats json/srt/vtt, language and prompt options for accuracy
- Deepgram streaming: WebSocket connection, interim results for UI, final results for processing, VAD for detecting pauses
- NestJS audio endpoint: FileInterceptor + Multer, mimetype and size validation, openai.toFile() for Buffer to File
- Accuracy optimization: language hint (−10-15% WER), prompt with terms, audio preprocessing (16kHz mono), LLM post-processing via gpt-4o-mini
- WER (Word Error Rate) - the quality metric. Whisper: ~5% EN, ~10% RU on clean audio, 15-25% on noisy real-world data
- STT is not one component: punctuation restoration, speaker diarization, language detection, VAD - all separate tasks
Вопросы для размышления
- Streaming STT (Deepgram) costs USD 0.0059/min while batch Whisper API is USD 0.006/min - even though streaming is technically more complex. What's the business logic behind that pricing paradox?
- Speaker diarization (who said what) costs +USD 0.01/min at AssemblyAI - more than the transcription itself. Why is this a hard separate ML problem rather than part of base STT?
- If Whisper large-v3 is open-source and best-in-class on WER, why pay Deepgram USD 0.0059/min instead of running self-hosted Whisper? What are the real trade-offs?
What's Next
Speech-to-Text is the entry point of a voice interface. The next step is the reverse: Text-to-Speech for generating voice responses. Together, STT and TTS form the voice pipeline for AI assistants.
- Text-to-Speech: Speech Synthesis — STT to text processing to TTS - the complete voice pipeline for voice assistants
- Multimodal AI — STT is one of the multimodal inputs alongside vision and documents
- Real-time AI — Streaming STT + streaming LLM + streaming TTS = real-time voice conversation
Связанные уроки
- aie-05-api-integration — STT providers are consumed through their APIs
- aie-08-streaming — Streaming transcription reuses chunked streaming
- aie-25-multimodal — Speech is one input of a multimodal system
- aie-43-realtime-ai — Streaming STT feeds real-time voice pipelines
- dsp-20-audio-ai — STT front-ends rest on digital audio signal processing
- ml-30-rnn-lstm — Sequence transcription parallels seq2seq modeling
- ml-29-cnn