Machine Learning

Transformers and the Attention Mechanism

In 2017, eight researchers from Google published a paper with an audacious title - Attention Is All You Need. They proposed abandoning recurrent networks and convolutions entirely, replacing them with a single mechanism - attention. The community was skeptical: how can one mechanism replace architectures proven over years? A few years later - and that one paper had overturned the whole field. GPT, BERT, ChatGPT, DALL-E, Stable Diffusion, Claude, Gemini, LLaMA - all built on the Transformer architecture. One idea, 8 authors, 15 pages - and all of modern AI.

**GPT-4 and ChatGPT** - a decoder-only Transformer generating text token by token via masked self-attention. Hundreds of billions of parameters trained on trillions of text tokens, but at the core is the very same attention mechanism from the 2017 paper
**Machine translation (Google Translate)** - an encoder-decoder Transformer encodes the source sentence via the encoder, and the decoder generates the translation using cross-attention to the encoded input. Quality has improved so much that for many language pairs it approaches human-level
**DALL-E and Stable Diffusion** - the Transformer processes the text prompt through attention, creating a representation that guides image generation. Cross-attention links text tokens to spatial regions of the image, enabling specific objects to be placed in specific locations

Предварительные знания

Recurrent Networks: RNN, LSTM, GRU

Attention Is All You Need

In 2017 a team led by Ashish Vaswani at Google published a paper with the bold title "Attention Is All You Need." Until then, sequence models relied on recurrence or convolution to handle word order, processing tokens one after another. The Transformer threw that out and kept only the attention mechanism, letting every token attend to every other token in parallel. This made training far faster on modern hardware and removed the long-range memory bottleneck of RNNs. The architecture turned out to scale remarkably well, and within a few years it became the foundation of BERT, GPT, and nearly every large language model that followed.

Self-Attention: who looks at what

Consider the sentence: *The animal didn't cross the street because it was too tired*. What does the word "it" refer to? A human immediately understands - to "animal". But for a model this is a task: it needs to figure out which words in the sentence are **related** to each other. Self-attention solves exactly this task: each token (word) "looks" at all the other tokens and decides which ones to pay more attention to. The word "it" will learn to look at "animal" because the context "was too tired" points to a living creature, not a street.

How does it work technically? Each token is represented by three vectors: **Query** ("what am I looking for?"), **Key** ("what do I contain?"), and **Value** ("what information do I pass on?"). These three vectors are obtained by multiplying the token's embedding by three learnable matrices W_Q, W_K, W_V. Intuition: Query is the question, Key is the answer to "do I match?", Value is the actual information to be passed.

The self-attention formula: **Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V**. Here Q @ K^T is the matrix of dot products of all Queries with all Keys (size: seq_len x seq_len). Division by sqrt(d_k) is scaling, where d_k is the dimension of the Key vector. Softmax turns scores into probabilities (sum = 1 for each row). Multiplying by V gives the weighted sum of Value vectors.

**Why divide by sqrt(d_k)?** Without scaling, the dot products Q @ K^T grow proportionally to d_k. With d_k = 512, scores can reach hundreds. Softmax of such large numbers gives an almost one-hot vector (one position ~1.0, others ~0.0), and gradients become nearly zero - training stalls. Dividing by sqrt(d_k) normalizes the variance of scores to ~1.0 regardless of dimension. This allows softmax to operate in a "soft" mode, distributing attention across multiple tokens rather than fixating on just one.

In self-attention, each token is represented by three vectors Q, K, V. What exactly does the matrix Q @ K^T compute?

Multi-Head Attention: parallel perspectives

A single self-attention head learns to focus on a particular type of relationship. But language has many types of dependencies simultaneously: syntactic (subject-verb), semantic (synonyms, antonyms), coreference ("it" refers to "animal"), temporal (sequence of events). One head cannot capture all types of relationships at once. The solution - **run several heads in parallel**, each with its own matrices W_Q, W_K, W_V.

How multi-head attention is structured: instead of one set {W_Q, W_K, W_V} of dimensionality d_model x d_model, we create h sets {W_Q_i, W_K_i, W_V_i} of dimensionality d_model x d_k, where d_k = d_model / h. Each head operates on a smaller-dimensional vector, but there are multiple heads. In the original Transformer: d_model = 512, h = 8 heads, d_k = 512 / 8 = 64 per head.

**Number of parameters in multi-head attention:** For one head: 3 matrices (W_Q, W_K, W_V) of size d_model x d_k = 512 x 64 = 32,768 parameters x 3 = 98,304. For h = 8 heads: 98,304 x 8 = 786,432 parameters. Plus the final projection W_O of size d_model x d_model = 512 x 512 = 262,144. **Total: ~1,048,576 parameters (1M)** per multi-head attention layer. Important: the parameter count is **the same** regardless of whether we use 1 head with d_k = 512 or 8 heads with d_k = 64. But 8 heads learn different types of dependencies in parallel!

Research has shown that different heads do indeed specialize. In trained models you can observe: some heads track syntactic dependencies (subject-verb), others positional patterns ("I look at the previous token"), others rare but important semantic connections. Some heads turn out to be redundant - pruning experiments show that 20-40% of heads can be removed without significant loss of quality.

In a Transformer with d_model=512 and 8 heads, what is the dimension of Q, K, V for each individual head?

Positional Encoding: adding order

Self-attention has a fundamental problem: it **knows nothing about the order of tokens**. If you shuffle the words in a sentence, the attention weights will change (because word embeddings differ), but the self-attention mechanism itself contains no information about *positions*. For it, "the dog bit the man" and "the man bit the dog" are the same if each word's embedding is identical. RNNs solved this automatically - they processed tokens sequentially, one by one. The Transformer processes all tokens in parallel, so positional information must be **added explicitly**.

The original paper "Attention Is All You Need" uses **sinusoidal positional encoding**: for each position and each embedding dimension, a value is computed using sin and cos with different frequencies. Formula: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)), PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). Here pos is the token's position, i is the dimension index. Each dimension uses its own frequency, from fast (2*pi) to slow (2*pi*10000).

**Three approaches to positional encoding:** 1. **Sinusoidal (original Transformer):** fixed, not learned. Pro: can extrapolate to lengths not seen during training. Con: doesn't adapt to the task. 2. **Learned embeddings (BERT, GPT):** a trainable vector per position. Pro: adapts to data, usually slightly better quality. Con: fixed maximum length (BERT = 512, GPT-2 = 1024). 3. **Relative positional (T5, ALiBi):** encodes not the absolute position but the distance between tokens. Pro: better generalization to long sequences. ALiBi adds a penalty to attention scores proportional to distance: the further tokens are from each other, the lower the score.

An important property of sinusoidal encoding: the dot product of PE(pos) and PE(pos + k) depends only on the shift k, not on the absolute position pos. This means the distance between positions 5 and 8 "looks" the same as the distance between positions 100 and 103. The model can learn patterns like "next word" or "two words later" regardless of absolute position in the sentence.

Why does a Transformer need positional encoding when attention already computes interactions between all pairs of tokens?

Encoder-Decoder architecture: building the Transformer

Now we have all the building blocks: self-attention, multi-head attention, positional encoding. It's time to assemble the full Transformer. The original architecture (2017) consists of two stacks: the **Encoder** encodes the input sequence into a continuous representation, and the **Decoder** generates the output sequence token by token using that representation. Encoder and Decoder are made of identical blocks, repeated N times (N = 6 in the original paper).

Each Encoder block contains two sub-layers: 1. multi-head self-attention and (2) a feed-forward network (two linear layers with ReLU: FFN(x) = max(0, x @ W1 + b1) @ W2 + b2, dimensions d_model=512 -> d_ff=2048 -> d_model=512). Each Decoder block contains three sub-layers: 1. **masked** multi-head self-attention (masking future positions) 2. **cross-attention** (Key and Value from Encoder, Query from Decoder) 3. feed-forward network. Around each sub-layer: **residual connection + Layer Normalization**: output = LayerNorm(x + Sublayer(x)).

**Three types of attention in a Transformer:** 1. **Encoder self-attention:** each input token looks at ALL input tokens. No masks - full visibility. 2. **Masked decoder self-attention:** each output token looks only at PREVIOUS output tokens. Future positions are masked with -inf (after softmax = 0). This prevents the model from "peeking" at the answer during generation. 3. **Cross-attention (encoder-decoder):** Query from decoder, Key and Value from encoder. The decoder "asks questions" to the encoded input. For example, when translating "I love cats", the decoder at the step of generating "love" focuses on "love" from the encoder through cross-attention.

**Residual connections** - output = x + Sublayer(x) - solve the vanishing gradient problem in a deep stack (6+ layers). The gradient can flow directly through the addition (skip connection), bypassing the layer. **Layer Normalization** normalizes activations along the last dimension (d_model), stabilizing training. The Add & LayerNorm combination after every sub-layer is a critical component without which deep Transformers don't train.

Why did Transformer beat RNN? Three reasons. 1. **Parallelism**: RNNs process tokens sequentially (each depends on the previous), Transformers process all in parallel. GPU training is dozens of times faster. 2. **Long-range dependencies**: in RNNs information from a distant token must "flow" through all intermediate steps, fading away. In a Transformer any two tokens are connected directly via attention (distance = 1 layer). 3. **Scalability**: increasing the model (more layers, heads, d_model) gives predictable quality gains - the scaling laws discovered by the OpenAI team.

Transformer has completely replaced all previous architectures - CNN and RNN are no longer needed

CNN is still more efficient for fixed-size image processing on constrained resources, and RNN/LSTM remain the best choice for streaming data on edge devices with strict memory and latency requirements

Key ideas

**Self-attention:** each token computes three vectors (Query, Key, Value), the matrix Q @ K^T gives compatibility scores, softmax with sqrt(d_k) scaling turns them into weights, and multiplying by V yields a context-enriched representation of each token
**Multi-head attention:** instead of one large head - h parallel heads with reduced dimension d_k = d_model / h, each specializing in its own type of dependency (syntax, semantics, coreference), results are concatenated and projected back
**Positional encoding:** self-attention is permutation-invariant, so positional information is added explicitly - via sinusoidal functions (fixed), learned embeddings (BERT/GPT), or relative encodings (ALiBi/T5)
**Encoder-Decoder:** the encoder encodes input through self-attention + FFN (N times), the decoder generates output through masked self-attention + cross-attention to encoder + FFN (N times), with residual connections and LayerNorm around each sub-layer
**One mechanism - all of modern AI:** those 8 Google researchers in 2017 were right - attention truly turned out to be all that was needed. GPT, BERT, Claude, DALL-E all stand on the Transformer foundation, differing only in which part of the architecture they use and what data they were trained on

Вопросы для размышления

Transformer has quadratic complexity O(n^2) with sequence length due to the attention matrix. What approaches can address this problem for long texts (books, codebases), and what trade-offs do they introduce?
Decoder-only models (GPT, LLaMA, Claude) dominate in 2024-2026, even though the original Transformer was encoder-decoder. Why did decoder-only turn out to be sufficient for most tasks, and for which tasks is encoder-decoder still preferable?
Positional encoding adds position information but also limits the maximum context length. How do models like GPT-4 or Claude work with contexts of hundreds of thousands of tokens if they were trained on shorter lengths?

Связанные уроки

ml-30-rnn-lstm — Transformers solve RNN sequential bottleneck
ml-37-bert-gpt — BERT and GPT are transformer architectures
ml-36-seq2seq — Attention generalized the seq2seq decoder
la-07-matrix-multiply — Attention is scaled dot-product matrix math
la-13-eigenvectors — Attention weights project onto value subspaces
aie-03-llm-fundamentals
aie-13-advanced-rag