Linear Algebra

Dot Product: The Operation Holding Neural Networks Together

Every time Spotify recommends a song or Qdrant searches a billion embeddings, it runs one operation: the dot product. Cosine similarity between two embedding vectors is computed billions of times per second across the world's ML infrastructure.

Recommendation: cosine similarity between embeddings = dot product of normalized vectors
Transformer attention: Q·Kᵀ is a matrix of all pairwise dot products between tokens
Physics: work done by force F over displacement d = F·d
Signal processing: correlation of two signals = their dot product over time
Computer vision: similarity between feature maps in a neural network layer

Dot Product: The Operation Holding Neural Networks Together

**ChatGPT reads a question and decides which word matters for which - through one operation.** FAISS finds the nearest document among a billion in milliseconds - through the same one. Spotify compares a user's taste with a track - again, the same. That operation is the ~dot product~{dot product, inner product - sum of element-wise products of two vectors}. It turns the question "how similar are two vectors?" into a single number. And that number governs most of modern ML.

**What this lesson actually teaches**: not "how to multiply components", but why one simple operation underlies cosine similarity, the attention mechanism in Transformers, and projection in PCA. By the end, all three will turn out to be the same thing.

What is the key idea of the concept 'Dot Product: The Operation Holding Neural Networks Together'?

Check that the concept material has been understood.

The formula: as simple as it gets

The dot product is the sum of pairwise products of components. No tricks - just multiply the i-th by the i-th and add everything up:

FORMULA (n-dimensional case): a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ = Σ aᵢbᵢ EXAMPLE: a = (3, 1, 2) b = (1, 4, 2) a · b = 3×1 + 1×4 + 2×2 = 3 + 4 + 4 = 11 In numpy - one line: a @ b or np.dot(a, b) or (a * b).sum()

What is the key idea of the concept 'The formula: as simple as it gets'?

Check that the concept material has been understood.

Geometric meaning: angle between vectors

The same operation has a second face - through the angle between vectors. Both formulas give the same number, but lead to very different intuitions:

a · b = |a| · |b| · cos(θ) where θ is the angle between the vectors WHAT TO READ FROM THE FORMULA: cos(0°) = 1 → same direction → maximum cos(90°) = 0 → perpendicular → zero cos(180°)= -1 → opposite directions → minimum FROM THIS, THE ANGLE FORMULA: cos(θ) = (a · b) / (|a| · |b|) For unit vectors (|a| = |b| = 1): cos(θ) = a · b <- one multiplication, no divisions This is exactly why OpenAI pre-normalizes embeddings.

Value of a·b	Angle θ	What it means
a·b > 0	0° - 90°	Vectors point "in the same direction"
a·b = 0	90°	Vectors are perpendicular - fully independent
a·b < 0	90° - 180°	Vectors point "in opposite directions"
a·b = \|a\|·\|b\|	0°	Perfectly aligned - maximum similarity

What is the key idea of the concept 'Geometric meaning: angle between vectors'?

Check that the concept material has been understood.

Application 1: cosine similarity - the heart of search

**Cosine similarity** is the dot product of normalized vectors. Qdrant, FAISS, pgvector all store normalized embeddings precisely because nearest-neighbor search then reduces to finding the maximum dot product.

FORMULA: cosine_sim(a, b) = (a · b) / (|a| · |b|) Range: -1 (opposite) to 1 (identical) For normalized vectors (||a|| = ||b|| = 1): cosine_sim(a, b) = a · b <- just a dot product! EXAMPLE with embeddings: sim("cat", "kitten") ~= 0.95 <- almost the same sim("cat", "dog") ~= 0.70 <- similar (both animals) sim("cat", "algebra") ~= 0.10 <- unrelated

**Why not Euclidean distance?** For embeddings, direction matters, not length. Two documents with the same meaning but different lengths produce the same angle after normalization - but different distances. Cosine similarity is robust to text length.

What is the key idea of the concept 'Application 1: cosine similarity - the heart of search'?

Check that the concept material has been understood.

Application 2: attention in Transformers

The self-attention mechanism in GPT, BERT, and every modern Transformer is a matrix of dot products. Each token asks every other token: "how relevant is it to me?" - and gets the answer through a dot product.

ATTENTION SCORE between tokens i and j: score(i, j) = qᵢ · kⱼ / sqrt(d_k) where: qᵢ - query vector of token i ("what I'm looking for") kⱼ - key vector of token j ("what I offer") d_k - dimension (e.g. 64) - normalization to prevent exploding values at high dimensionality ALL PAIRS AT ONCE (matrix form): Scores = Q · Kᵀ / sqrt(d_k) shape: (seq_len, seq_len) Each element of Scores is one dot product. For GPT-4 with a context of 8192 tokens, that's 8192² = 67 million dot products per attention head per pass.

**All of attention is dot products.** Scores = Q·Kᵀ is literally a matrix of dot products. Softmax normalizes them into weights, output = weights · V. Understanding the dot product means understanding the core of Transformers.

What is the key idea of the concept 'Application 2: attention in Transformers'?

Check that the concept material has been understood.

Application 3: projection - the basis of PCA

The projection of vector **a** onto direction **b** is literally the "shadow" of **a** along the axis **b**. In PCA this is repeated for every data point and every principal component: features are projected onto new axes.

SCALAR PROJECTION (shadow length): proj_b(a) = (a · b) / |b| = |a| · cos(θ) If |b| = 1 (unit vector): proj_b(a) = a · b <- again just a dot product VECTOR PROJECTION (shadow as a vector): proj_b(a) = ((a · b) / |b|²) · b EXAMPLE: a = (4, 3), b = (1, 0) <- the X axis a · b = 4×1 + 3×0 = 4 Projection onto X axis = 4 <- that's the x-coordinate IN PCA: Data X, shape (n_samples, n_features) Principal component pc, shape (n_features,), ||pc||=1 Projections of all points: X @ pc shape (n_samples,) One matrix-vector operation = one full PCA step.

**The link between all three applications**: cosine similarity, attention, and PCA are one operation in three costumes. Cosine similarity = dot product of normalized vectors. Attention = matrix of dot products with softmax. PCA projection = dot product with a principal component. Understand the dot product - understand the foundation of all three.

What is the key idea of the concept 'Application 3: projection - the basis of PCA'?

Check that the concept material has been understood.

Connection to vector length

One special case of the dot product deserves a separate note: a vector dotted with itself.

a · a = a₁² + a₂² + ... + aₙ² = |a|² FROM THIS: |a| = sqrt(a · a) NUMERICAL EXAMPLE: a = (3, 4) a · a = 9 + 16 = 25 |a| = sqrt(25) = 5 <- Pythagorean theorem In numpy: np.sqrt(a @ a) # same as np.linalg.norm(a) IMPORTANT: the angle between a and a is 0°, cos(0°) = 1, so a·a = |a|·|a|·1 = |a|². Both definitions are consistent.

What is the key idea of the concept 'Connection to vector length'?

Check that the concept material has been understood.

Where dot products run today

Dot product in modern systems

One operation - different names in different contexts

Component	Role	Details
Cosine similarity / vector search	dot product of normalized embeddings	FAISS, Qdrant, pgvector, Pinecone - search across billions of documents
Attention (Transformers)	Scores = Q · Kᵀ / sqrt(d_k)	GPT-4, BERT, T5, LLaMA, Stable Diffusion - every attention layer
PCA / projections	X @ principal_component	Dimensionality reduction, feature extraction, whitening
Recommendations	user_emb · item_emb	Two-tower models: YouTube, Netflix, Spotify - matrix of dot products
Linear classifier	score(class) = w_class · x	Final layer of ImageNet classifiers, softmax over dot products
3D graphics: lighting	brightness = light · normal	Lambertian model - foundation of all 3D rendering from OpenGL to Unreal

What is the key idea of the concept 'Where dot products run today'?

Check that the concept material has been understood.

Practice: a recommender system

Interview questions

Why does the attention score formula divide by sqrt(d_k)?

- For d_k-dimensional random vectors, the variance of the dot product grows as d_k - Dividing by sqrt(d_k) brings the variance back to 1 - Without normalization, at d_k=64 the scores get very large - softmax produces nearly one-hot weights - One-hot weights kill the gradient - the model stops learning - Vaswani et al. 2017 (the original Transformer paper) explain this in Section 3.2.1

Cosine similarity vs. Euclidean distance - what is the difference for normalized vectors?

- |a - b|² = |a|² - 2(a·b) + |b|² - For normalized vectors: |a - b|² = 2 - 2(a·b) - So |a - b| = sqrt(2 - 2·cosine_sim) - a monotone function of cosine_sim - For normalized vectors both metrics give the same ranking - Qdrant uses dot product as its default metric for exactly this reason

Projection of vector x onto unit vector u - how is this related to the dot product?

- proj_u(x) = (x · u) · u, scalar length of projection = x · u - x · u = |x| cos(θ) - exactly the length of the shadow of x onto u - In PCA each data row is projected onto principal components - that is X @ Vᵀ - Orthogonality of components is guaranteed because their dot products equal 0

What is the key idea of the concept 'Practice: a recommender system'?

Check that the concept material has been understood.

What to take from this lesson

**a · b = Σ aᵢbᵢ = |a||b|cos(θ)** - two formulas, one number
**Sign** encodes direction: > 0 = aligned, 0 = perpendicular, < 0 = opposite
**Cosine similarity** = dot product of normalized vectors - the foundation of all vector search
**Attention** = matrix Q·Kᵀ / sqrt(d_k) - all of GPT rests on this dot product
**Projection** of x onto unit u = x · u - exactly what PCA does to each feature
**a · a = |a|²** - length via dot product; normalization = dividing by sqrt(a·a)

What's next

The dot product is a building block. Here is what is built on top:

Matrices and operations — Matrix multiplication = a collection of dot products of rows and columns
Eigenvectors and SVD — PCA uses SVD; principal components are orthogonal (dot product = 0)
Linear transformations — Every linear layer of a neural network is a matrix of dot products of input with weights

Dot Product: The Operation Holding Neural Networks Together

The formula: as simple as it gets

The formula: as simple as it gets

Geometric meaning: angle between vectors

Geometric meaning: angle between vectors

Application 1: cosine similarity - the heart of search

Application 1: cosine similarity - the heart of search

Application 2: attention in Transformers

Application 2: attention in Transformers

Application 3: projection - the basis of PCA

Application 3: projection - the basis of PCA

Connection to vector length

Connection to vector length

Where dot products run today

Where dot products run today

Dot product in modern systems

Practice: a recommender system

Practice: a recommender system

Interview questions

What to take from this lesson

What's next

Связанные уроки