Machine Learning
Seq2Seq and the Attention Mechanism in NLP
Machine translation had been a dream since the 1950s - decades of rule-based approaches and statistical models delivered mediocre quality. In 2014, the Seq2Seq architecture with the Attention mechanism reached quality close to human. The key insight turned out to be simple: let the model itself decide which parts of the input sentence to look at when generating each output word.
- **Google Translate** - until 2016 it used Seq2Seq with Attention (GNMT), processing 140 billion words per day across 100+ language pairs and showing a 60% quality improvement over the previous statistical system
- **Autocorrect on smartphones** - Google Gboard and Apple iOS keyboards use lightweight Seq2Seq models for next-word prediction and autocorrection in real time, running locally without internet
- **Chatbots and dialogue systems** - Seq2Seq became the foundation for response generation in conversational AI, from simple FAQ bots to voice assistants, where an input sequence (question) is transformed into an output sequence (answer)
Предварительные знания
Sequence to sequence learning and the birth of attention
In 2014 a Google Brain team of Ilya Sutskever, Oriol Vinyals, and Quoc Le published "Sequence to Sequence Learning with Neural Networks." Their idea was deceptively simple: read an input sentence with one LSTM, compress it into a single fixed vector, then unroll a second LSTM to generate the output one token at a time. It worked well enough to rival the statistical machine translation systems of the day, and the encoder-decoder template quickly spread across translation, summarization, and dialogue. The same year, Dzmitry Bahdanau, working with Kyunghyun Cho and Yoshua Bengio in Montreal, spotted the weak point: cramming a whole sentence into one vector lost information on long inputs. Their answer was the attention mechanism, which let the decoder look back at every encoder state and weight them per step. Attention removed the bottleneck and, three years later, became the foundation of the Transformer.
Encoder-Decoder for NLP
Tasks like machine translation have a fundamental property: the length of the input sequence **does not match** the length of the output. "I love cats" (3 words) translates as "Я люблю кошек" (3 words), but "I like to walk" (4 words) becomes "Мне нравится гулять" (3 words). A standard RNN accepts and outputs sequences of the same length. The **Encoder-Decoder** (Seq2Seq) architecture solves this problem by splitting the model into two parts: the encoder reads the entire input, and the decoder generates output of arbitrary length.
The **encoder** is an RNN (usually LSTM or GRU) that processes the input sequence word by word. Each word first passes through an embedding layer (from the Word Embeddings lesson), then enters the recurrent cell. The final hidden state of the encoder is called the **context vector** - it is a fixed-size vector (for example, 256 or 512 numbers) that compresses *the entire content* of the input sentence.
The **decoder** is a second RNN that takes the context vector as its initial hidden state and generates the output sequence one token at a time. At each step the decoder receives the previously generated word and its current state, and predicts the next word. Generation continues until the decoder produces the special `<END>` token.
**The information bottleneck problem:** The context vector is a fixed-length vector - say, 256 numbers. It must pack *the entire content* of the input sentence. For a short 5-word sentence this works fine. But for a 50-word paragraph? Or a whole document? Experiments showed that Seq2Seq quality drops sharply on long sentences (more than 20–30 words). Information about the first words "dilutes" as it passes through dozens of RNN cells. This problem is solved by the **Attention** mechanism - covered in the next concept.
The encoder and decoder can have **different vocabularies** (vocab_size). When translating from English to French, the encoder works with English words (10,000 tokens) and the decoder with French words (8,000 tokens). The hidden_dim must match because the context vector is passed from the encoder to the decoder as the initial hidden state.
Why does the quality of a Seq2Seq model (without Attention) drop sharply on long sentences?
The Attention Mechanism
In 2014, Dzmitry Bahdanau proposed the **Attention** mechanism - an elegant solution to the bottleneck problem. The idea: instead of compressing the entire input into one vector, allow the decoder at each generation step to **look at all encoder hidden states** and decide for itself which parts of the input sentence to focus on. When translating the word "cats", the decoder can look at the encoder state for the word "кошек", ignoring "Я" and "люблю".
How are attention weights computed? You need a **score function** that measures how compatible the current decoder state `s_t` is with each encoder state `h_i`. Bahdanau proposed **additive attention**: score(s_t, h_i) = v · tanh(W1 · s_t + W2 · h_i), where W1, W2, v are learnable matrices. Luong in 2015 proposed a simpler variant - **multiplicative (dot-product) attention**: score(s_t, h_i) = s_t · h_i (dot product). Both approaches work, but dot-product attention is faster and became the basis for the Transformer.
**Three steps of computing Attention:** 1. **Score** - compute the compatibility of decoder state s_t with each encoder state h_i: - Bahdanau: score_i = v · tanh(W1 · s_t + W2 · h_i) - Luong: score_i = s_t^T · h_i (dot-product) 2. **Normalize** - convert scores to probabilities via softmax: - alpha_i = exp(score_i) / sum(exp(score_j)) - Sum of all alpha = 1.0 3. **Context** - weighted sum of encoder states: - context_t = sum(alpha_i · h_i) - Result: vector of the same size as h_i
Attention brought two key improvements. First, it **eliminated the bottleneck** - now the decoder at each step has direct access to all encoder states, not just the compressed context vector. Second, attention weights create an **interpretable alignment** between input and output words - you can visualize which input words the model was looking at when generating each output word. This visualization helps understand and debug the model.
How does Attention differ from basic Seq2Seq when generating each output word?
Beam Search
The decoder generates one word at a time, choosing from a vocabulary of thousands of tokens. The simplest strategy is **greedy decoding**: at each step, pick the word with the highest probability. Fast, but often produces poor translations. Why? Because the locally best word at step 2 may lead to a dead end at step 5. Example: greedy might pick "I am student" (each word plausible in isolation), even though "I am a student" is more natural.
**Beam Search** is a compromise between greedy (check 1 path) and exhaustive search (check all paths, but that is exponential). Beam Search maintains the **top-k** (beam width) best partial translations at each step. At each step, for each of the k candidates, all possible next words are considered, probabilities of full paths are computed, and the top-k are selected again. Typical beam width: 4–10 for machine translation.
**Length normalization - an important detail:** The log-probability of a sentence equals the sum of log-probabilities of its words. The longer the sentence, the more negative terms, the lower the overall score. Without correction, Beam Search favors **short** sentences. Solution - **length normalization:** score_normalized = (1/L^alpha) * sum(log P(w_t)) where L is the sentence length and alpha is a hyperparameter (usually 0.6–0.7). alpha = 0: no normalization (prefers short sentences) alpha = 1: full normalization (average log-prob per word) alpha = 0.6: sweet spot (used in Google Translate)
In practice, Beam Search with beam width 4–5 gives the main quality boost over greedy decoding. Increasing the beam width to 10–20 provides marginal improvement while doubling or tripling generation time. Interestingly, a **very large** beam width (100+) can actually *hurt* quality - the model starts finding probable but unnatural translations. This is why production systems typically use a beam width of 4–8.
Why is length normalization applied in Beam Search?
Teacher Forcing
When training a Seq2Seq model, the decoder must generate a sequence word by word. A question arises: what should be fed to the decoder as the "previous word" at each step - its own prediction or the correct word from the training data? If you feed the model's prediction, an early error (wrong first word) will cascade - each subsequent prediction will be based on incorrect context. **Teacher Forcing** solves this: during training the decoder receives the **correct previous word** (ground truth) instead of its own prediction.
Teacher Forcing significantly **speeds up training** - the model converges 2–5x faster because each decoder step receives high-quality context. Additionally, Teacher Forcing enables **parallelization**: since all decoder input tokens are known in advance (they are the ground truth), all steps can be computed in parallel rather than sequentially. This is especially valuable on a GPU.
**Exposure Bias - the downside of Teacher Forcing:** During training the decoder always sees correct previous words (ground truth). During inference (real usage) - its own predictions, which contain errors. The model was never trained to recover from errors! This is called **exposure bias** - a mismatch between training conditions and deployment conditions. Consequences: - On training data the model performs excellently - On real data errors accumulate: one mistake affects all subsequent words - The longer the generated sequence, the stronger the degradation
**Scheduled Sampling** (Bengio, 2015) is a practical fix for exposure bias. The idea: at the start of training use 100% Teacher Forcing (fast convergence), then gradually increase the share of the model's own predictions. By the end of training the model runs fully in autoregressive mode, as in real use. Typical schedules: linear (from 1.0 to 0.0 over N epochs), exponential (ratio = k^epoch). Teacher Forcing and its variants remain a standard training technique not only for Seq2Seq but also for Transformer architectures (GPT, BERT), where Teacher Forcing is implemented via **causal masking** - each position sees only the preceding ground truth tokens.
Seq2Seq is obsolete since the Transformer appeared - nobody uses it anymore
Lightweight GRU-based Seq2Seq models are actively used on mobile devices and in real-time applications where the Transformer is too heavy
The Transformer requires significant compute due to self-attention with quadratic complexity O(n^2). On mobile devices, in IoT, and in tasks with strict latency requirements (autocorrect on keyboards, voice commands, streaming subtitle translation), compact GRU-based Seq2Seq models run 10–100x faster at acceptable quality. Moreover, the concepts from Seq2Seq (Encoder-Decoder, Attention, Teacher Forcing, Beam Search) are the foundation for understanding the Transformer.
What is the exposure bias problem when using Teacher Forcing?
Key Ideas
- **Encoder-Decoder:** the Encoder (RNN) reads the input and compresses it into a fixed-size context vector; the Decoder (a separate RNN) generates output word by word - this allows working with variable-length sequences but creates a bottleneck on long sentences
- **Attention:** at each step the decoder computes attention weights for all encoder states via score + softmax and creates a weighted context vector - this eliminates the bottleneck and provides interpretable word alignment between input and output
- **Beam Search:** instead of greedily picking one best word at each step, it maintains the top-k candidates (beam width = 4–8), and length normalization prevents a preference for short sentences
- **Teacher Forcing:** during training the decoder receives correct previous words instead of its own predictions, which accelerates convergence but creates exposure bias - Scheduled Sampling addresses this by gradually transitioning the model to autoregressive mode. These techniques became the foundation: from the dream of the 1950s to the Transformer and GPT
Related Topics
Seq2Seq connects recurrent networks with modern NLP architectures and is the foundation for understanding the Transformer:
- RNN and LSTM — The Encoder and Decoder in Seq2Seq are built on RNN/LSTM/GRU - recurrent networks that process sequences. Without understanding hidden states and vanishing gradients it is impossible to understand why Attention is needed
- Transformer — The Transformer (2017) replaced the RNN in Seq2Seq with self-attention but kept the Encoder-Decoder structure, the Attention mechanism, and Teacher Forcing. Seq2Seq is the direct predecessor of the Transformer
- Word Embeddings — The embedding layer in encoder and decoder converts words into dense vectors. The quality of embeddings directly affects Seq2Seq quality - pretrained Word2Vec or GloVe accelerate convergence
- BERT and GPT — BERT uses the encoder from Seq2Seq (bidirectional), GPT uses the decoder (autoregressive). Both architectures inherit Teacher Forcing, Attention, and the Encoder-Decoder idea, scaled to billions of parameters
Вопросы для размышления
- Attention computes a weighted sum of all encoder states at each decoder step. This means the decoder has access to the entire input. Then why is the encoder needed at all - why not feed raw embeddings of input words directly into Attention?
- Beam Search with beam width 100+ can actually hurt translation quality compared to beam width 5. Why doesn't finding a more probable sequence guarantee a better translation? What does this say about the trained model?
- Teacher Forcing is used not only in Seq2Seq but also in GPT (via causal masking). Can exposure bias be eliminated entirely? What alternative training approaches for generative models could you propose?
Связанные уроки
- ml-35-word-embeddings — Encoder inputs are word embeddings
- ml-30-rnn-lstm — Classic seq2seq uses recurrent encoders
- ml-31-transformers — Attention generalized seq2seq decoding
- ml-37-bert-gpt — Encoder-decoder evolved into LLMs
- alg-14-dijkstra — Beam search resembles bounded path search
- aie-03-llm-fundamentals