Linear Algebra
Dot Product: The Operation Holding Neural Networks Together
Every time Spotify recommends a song or Qdrant searches a billion embeddings, it runs one operation: the dot product. Cosine similarity between two embedding vectors is computed billions of times per second across the world's ML infrastructure.
- Recommendation: cosine similarity between embeddings = dot product of normalized vectors
- Transformer attention: Q·Kᵀ is a matrix of all pairwise dot products between tokens
- Physics: work done by force F over displacement d = F·d
- Signal processing: correlation of two signals = their dot product over time
- Computer vision: similarity between feature maps in a neural network layer
Dot Product: The Operation Holding Neural Networks Together
**ChatGPT reads a question and decides which word matters for which - through one operation.** FAISS finds the nearest document among a billion in milliseconds - through the same one. Spotify compares a user's taste with a track - again, the same. That operation is the ~dot product~{dot product, inner product - sum of element-wise products of two vectors}. It turns the question "how similar are two vectors?" into a single number. And that number governs most of modern ML.
**What this lesson actually teaches**: not "how to multiply components", but why one simple operation underlies cosine similarity, the attention mechanism in Transformers, and projection in PCA. By the end, all three will turn out to be the same thing.
What is the key idea of the concept 'Dot Product: The Operation Holding Neural Networks Together'?
Check that the concept material has been understood.
The formula: as simple as it gets
The formula: as simple as it gets
The dot product is the sum of pairwise products of components. No tricks - just multiply the i-th by the i-th and add everything up:
FORMULA (n-dimensional case): a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ = Σ aᵢbᵢ EXAMPLE: a = (3, 1, 2) b = (1, 4, 2) a · b = 3×1 + 1×4 + 2×2 = 3 + 4 + 4 = 11 In numpy - one line: a @ b or np.dot(a, b) or (a * b).sum()
What is the key idea of the concept 'The formula: as simple as it gets'?
Check that the concept material has been understood.
Geometric meaning: angle between vectors
Geometric meaning: angle between vectors
The same operation has a second face - through the angle between vectors. Both formulas give the same number, but lead to very different intuitions:
a · b = |a| · |b| · cos(θ) where θ is the angle between the vectors WHAT TO READ FROM THE FORMULA: cos(0°) = 1 → same direction → maximum cos(90°) = 0 → perpendicular → zero cos(180°)= -1 → opposite directions → minimum FROM THIS, THE ANGLE FORMULA: cos(θ) = (a · b) / (|a| · |b|) For unit vectors (|a| = |b| = 1): cos(θ) = a · b <- one multiplication, no divisions This is exactly why OpenAI pre-normalizes embeddings.
| Value of a·b | Angle θ | What it means |
|---|---|---|
| a·b > 0 | 0° - 90° | Vectors point "in the same direction" |
| a·b = 0 | 90° | Vectors are perpendicular - fully independent |
| a·b < 0 | 90° - 180° | Vectors point "in opposite directions" |
| a·b = |a|·|b| | 0° | Perfectly aligned - maximum similarity |
What is the key idea of the concept 'Geometric meaning: angle between vectors'?
Check that the concept material has been understood.
Application 1: cosine similarity - the heart of search
Application 1: cosine similarity - the heart of search
**Cosine similarity** is the dot product of normalized vectors. Qdrant, FAISS, pgvector all store normalized embeddings precisely because nearest-neighbor search then reduces to finding the maximum dot product.
FORMULA: cosine_sim(a, b) = (a · b) / (|a| · |b|) Range: -1 (opposite) to 1 (identical) For normalized vectors (||a|| = ||b|| = 1): cosine_sim(a, b) = a · b <- just a dot product! EXAMPLE with embeddings: sim("cat", "kitten") ~= 0.95 <- almost the same sim("cat", "dog") ~= 0.70 <- similar (both animals) sim("cat", "algebra") ~= 0.10 <- unrelated
**Why not Euclidean distance?** For embeddings, direction matters, not length. Two documents with the same meaning but different lengths produce the same angle after normalization - but different distances. Cosine similarity is robust to text length.
What is the key idea of the concept 'Application 1: cosine similarity - the heart of search'?
Check that the concept material has been understood.
Application 2: attention in Transformers
Application 2: attention in Transformers
The self-attention mechanism in GPT, BERT, and every modern Transformer is a matrix of dot products. Each token asks every other token: "how relevant is it to me?" - and gets the answer through a dot product.
ATTENTION SCORE between tokens i and j: score(i, j) = qᵢ · kⱼ / sqrt(d_k) where: qᵢ - query vector of token i ("what I'm looking for") kⱼ - key vector of token j ("what I offer") d_k - dimension (e.g. 64) - normalization to prevent exploding values at high dimensionality ALL PAIRS AT ONCE (matrix form): Scores = Q · Kᵀ / sqrt(d_k) shape: (seq_len, seq_len) Each element of Scores is one dot product. For GPT-4 with a context of 8192 tokens, that's 8192² = 67 million dot products per attention head per pass.
**All of attention is dot products.** Scores = Q·Kᵀ is literally a matrix of dot products. Softmax normalizes them into weights, output = weights · V. Understanding the dot product means understanding the core of Transformers.
What is the key idea of the concept 'Application 2: attention in Transformers'?
Check that the concept material has been understood.
Application 3: projection - the basis of PCA
Application 3: projection - the basis of PCA
The projection of vector **a** onto direction **b** is literally the "shadow" of **a** along the axis **b**. In PCA this is repeated for every data point and every principal component: features are projected onto new axes.
SCALAR PROJECTION (shadow length): proj_b(a) = (a · b) / |b| = |a| · cos(θ) If |b| = 1 (unit vector): proj_b(a) = a · b <- again just a dot product VECTOR PROJECTION (shadow as a vector): proj_b(a) = ((a · b) / |b|²) · b EXAMPLE: a = (4, 3), b = (1, 0) <- the X axis a · b = 4×1 + 3×0 = 4 Projection onto X axis = 4 <- that's the x-coordinate IN PCA: Data X, shape (n_samples, n_features) Principal component pc, shape (n_features,), ||pc||=1 Projections of all points: X @ pc shape (n_samples,) One matrix-vector operation = one full PCA step.
**The link between all three applications**: cosine similarity, attention, and PCA are one operation in three costumes. Cosine similarity = dot product of normalized vectors. Attention = matrix of dot products with softmax. PCA projection = dot product with a principal component. Understand the dot product - understand the foundation of all three.
What is the key idea of the concept 'Application 3: projection - the basis of PCA'?
Check that the concept material has been understood.
Connection to vector length
Connection to vector length
One special case of the dot product deserves a separate note: a vector dotted with itself.
a · a = a₁² + a₂² + ... + aₙ² = |a|² FROM THIS: |a| = sqrt(a · a) NUMERICAL EXAMPLE: a = (3, 4) a · a = 9 + 16 = 25 |a| = sqrt(25) = 5 <- Pythagorean theorem In numpy: np.sqrt(a @ a) # same as np.linalg.norm(a) IMPORTANT: the angle between a and a is 0°, cos(0°) = 1, so a·a = |a|·|a|·1 = |a|². Both definitions are consistent.
What is the key idea of the concept 'Connection to vector length'?
Check that the concept material has been understood.
Where dot products run today
Where dot products run today
Dot product in modern systems
One operation - different names in different contexts
| Component | Role | Details |
|---|---|---|
| Cosine similarity / vector search | dot product of normalized embeddings | FAISS, Qdrant, pgvector, Pinecone - search across billions of documents |
| Attention (Transformers) | Scores = Q · Kᵀ / sqrt(d_k) | GPT-4, BERT, T5, LLaMA, Stable Diffusion - every attention layer |
| PCA / projections | X @ principal_component | Dimensionality reduction, feature extraction, whitening |
| Recommendations | user_emb · item_emb | Two-tower models: YouTube, Netflix, Spotify - matrix of dot products |
| Linear classifier | score(class) = w_class · x | Final layer of ImageNet classifiers, softmax over dot products |
| 3D graphics: lighting | brightness = light · normal | Lambertian model - foundation of all 3D rendering from OpenGL to Unreal |
What is the key idea of the concept 'Where dot products run today'?
Check that the concept material has been understood.
Practice: a recommender system
Practice: a recommender system
Interview questions
Why does the attention score formula divide by sqrt(d_k)?
- For d_k-dimensional random vectors, the variance of the dot product grows as d_k - Dividing by sqrt(d_k) brings the variance back to 1 - Without normalization, at d_k=64 the scores get very large - softmax produces nearly one-hot weights - One-hot weights kill the gradient - the model stops learning - Vaswani et al. 2017 (the original Transformer paper) explain this in Section 3.2.1
Cosine similarity vs. Euclidean distance - what is the difference for normalized vectors?
- |a - b|² = |a|² - 2(a·b) + |b|² - For normalized vectors: |a - b|² = 2 - 2(a·b) - So |a - b| = sqrt(2 - 2·cosine_sim) - a monotone function of cosine_sim - For normalized vectors both metrics give the same ranking - Qdrant uses dot product as its default metric for exactly this reason
Projection of vector x onto unit vector u - how is this related to the dot product?
- proj_u(x) = (x · u) · u, scalar length of projection = x · u - x · u = |x| cos(θ) - exactly the length of the shadow of x onto u - In PCA each data row is projected onto principal components - that is X @ Vᵀ - Orthogonality of components is guaranteed because their dot products equal 0
What is the key idea of the concept 'Practice: a recommender system'?
Check that the concept material has been understood.
What to take from this lesson
- **a · b = Σ aᵢbᵢ = |a||b|cos(θ)** - two formulas, one number
- **Sign** encodes direction: > 0 = aligned, 0 = perpendicular, < 0 = opposite
- **Cosine similarity** = dot product of normalized vectors - the foundation of all vector search
- **Attention** = matrix Q·Kᵀ / sqrt(d_k) - all of GPT rests on this dot product
- **Projection** of x onto unit u = x · u - exactly what PCA does to each feature
- **a · a = |a|²** - length via dot product; normalization = dividing by sqrt(a·a)
What's next
The dot product is a building block. Here is what is built on top:
- Matrices and operations — Matrix multiplication = a collection of dot products of rows and columns
- Eigenvectors and SVD — PCA uses SVD; principal components are orthogonal (dot product = 0)
- Linear transformations — Every linear layer of a neural network is a matrix of dot products of input with weights