Generative AI
Text-to-Speech and Voice Cloning
In 2022, ElevenLabs cloned Joe Biden's voice from a few minutes of audio - and a scandal about deepfakes broke out. A year later the company had 1 million paying users. Technology that cost millions in 2018 is now available for the price of an API call.
- **Audiobooks on demand:** publishers generate narration in dozens of languages without studio recording - Storytel and similar platforms are testing mass AI narration
- **Voice assistants:** GPT-4o voice mode uses an end-to-end speech model with TTFB under 300 ms - the threshold of perceived latency
- **Accessibility:** people with dyslexia and visual impairments gain access to any text through personalized TTS without waiting for manual narration
From WaveNet to Instant Voice Cloning
Modern neural speech synthesis began in 2016, when a DeepMind team led by Aaron van den Oord introduced WaveNet, an autoregressive model that generates audio sample by sample and was the first to sound nearly human. In 2017 Google presented Tacotron and Tacotron 2, going from text straight to a mel-spectrogram plus a vocoder, which simplified the whole pipeline. The cloning breakthrough came in 2023: Microsoft introduced VALL-E, which synthesizes speech in an unfamiliar voice from a three-second sample using neural codecs and language modeling over audio tokens. In parallel, the startup ElevenLabs (founded in 2022) made high-quality voice cloning available through an API.
Предварительные знания
- Transformers and autoregressive generation
- The idea of spectrograms and audio as a sequence
- Diffusion as one approach to generation
TTS Architectures
Modern TTS has traveled from concatenative synthesis (stitching recorded fragments) to neural end-to-end models. Tacotron 2 + WaveNet was the first breakthrough in 2018: Tacotron converts text to a mel-spectrogram, WaveNet reconstructs audio. FastSpeech 2 removed autoregression and became 38x faster at comparable quality. VALL-E from Microsoft (2023) goes further - it is a language model for audio tokens, not a specialized TTS system.
A mel-spectrogram is an intermediate representation: a 2D matrix (frequency x time) with a logarithmic frequency scale. A vocoder (WaveNet, HiFi-GAN, Vocos) reconstructs audio from the spectrogram. HiFi-GAN is an order of magnitude faster than WaveNet at comparable quality.
Why is FastSpeech 2 38x faster than Tacotron 2 at similar quality?
Voice Cloning
Voice cloning is speech synthesis matching a specific person's characteristics from a short audio sample. XTTS v2 clones a voice from 3 seconds of audio. VALL-E from Microsoft demonstrated cloning from a 3-second sample while preserving even the acoustic environment (echo, noise). ElevenLabs raised a Series A of 19 million USD in 2022 on exactly this technology - the voice-over market came under pressure.
Technically, cloning works through speaker embeddings - a numerical vector encoding the acoustic characteristics of a voice. A speaker encoder is trained to distinguish voices; its output is passed as a condition to the TTS decoder.
What is a speaker embedding in the context of voice cloning?
Prosody Control
Prosody is the rhythm, tempo, stress and intonation of speech. Without control over it, TTS sounds monotonous. Modern models control prosody through: (1) explicit parameters (pitch, speed, energy), (2) style tokens - vectors extracted from reference audio, (3) SSML markup (Speech Synthesis Markup Language), (4) natural language instructions - the new InstructTTS approach.
SSML is an XML-like markup language for TTS. It is supported by Google Cloud TTS, Azure, and Amazon Polly. It allows specifying pauses, stress, speed and pitch at the word and phrase level. InstructTTS (2023) accepts textual style descriptions: 'speak cheerfully with a slight accent'.
What is SSML and what is it used for in TTS?
Real-Time TTS and Streaming
Real-time TTS requires the first audio chunk in 200-300 ms - otherwise a conversation with an AI assistant feels unnatural. This dictates specific architectural choices: chunked generation (synthesis starts before the text is complete), streaming APIs, and latency-optimized models. The OpenAI Realtime API with GPT-4o audio delivers end-to-end voice conversation with latency under 320 ms.
Time-to-First-Byte (TTFB) for TTS is the time from request start to the first audio chunk. Target values: under 300 ms for conversational bots, under 100 ms for live broadcast systems. Buffering strategy: accumulate 0.5 seconds of audio before playback to eliminate jitter.
A higher-quality TTS model is always better for voice assistants
For conversational systems latency matters more than quality - users notice a 500 ms pause more than a slight reduction in naturalness
Psychoacoustic research shows that in dialogue, pauses over 300 ms are perceived as a freeze. The tts-1 model (fast) is often preferable to tts-1-hd (high quality) for interactive systems precisely because of TTFB
What is Time-to-First-Byte (TTFB) in the context of streaming TTS?
Key Ideas
- **FastSpeech 2 vs Tacotron 2:** parallel spectrogram generation vs autoregression - 38x speedup at comparable quality
- **Voice cloning:** speaker embeddings encode acoustic characteristics; XTTS v2 clones from 3 seconds of audio
- **Latency = UX:** TTFB under 300 ms is critical for conversational systems - streaming + buffering allows audio playback before generation completes
Related Topics
TTS and voice cloning intersect with several areas of generative AI:
- Audio and Music Generation — Shared architectures (vocoders, diffusion) and the task of synthesizing audio signals from latent space
- Diffusion Models — Diffusion-based TTS (VoiceBox, Matcha-TTS) applies diffusion to mel-spectrograms
Вопросы для размышления
- VALL-E clones a voice from a 3-second sample while preserving the acoustic environment. What ethical frameworks are needed for deploying such technology - and are current watermarking solutions sufficient?
- SSML allows explicit prosody control, InstructTTS accepts natural language descriptions. Which approach scales better for tasks with unusual intonation requirements?
- Streaming TTS with 200 ms TTFB creates the illusion of an instant response. How does this change user expectations from AI assistants - and what happens when the model does not know the answer but has already started speaking?
Связанные уроки
- gai-09 — Modern TTS uses diffusion-based audio synthesis
- gai-13 — TTS techniques generalize to music and audio
- aie-24-text-to-speech — Production TTS integration and APIs
- nlp-06 — Prosody control is sequence modeling over phonemes
- aie-43-realtime-ai — Real-time streaming TTS is a latency problem
- dl-01