Linear Algebra
Vectors: The Language ML Speaks
2012: AlexNet wins ImageNet by a 10-point margin. The entire network is matrix multiplication. 60 million parameters = one giant weight vector. Understanding what happens inside means understanding vectors.
- king - man + woman ≈ queen. Arithmetic over embeddings reproduces meaning. No magic - just vector addition.
- RAG, Spotify search, TikTok recommendations - one pattern: compute a dot product, find the nearest neighbor.
- Every GPT-4 token is a vector of ~12,288 dimensions. Attention is a matrix of dot products. The whole LLM is linear algebra.
Vector as an Arrow: the Geometric View
2012: AlexNet wins ImageNet by a 10-point margin. The entire network is matrix multiplication. 60 million parameters = one giant weight vector. Understanding what happens inside starts with geometry.
A vector is an arrow in space. Two properties define it: **length** (how much) and **direction** (which way). The tail of the arrow is usually pinned at the origin.
**Addition** follows the parallelogram rule. Place the tail of the second arrow at the tip of the first. The sum is a new arrow from the tail of the first to the tip of the second. In coordinates: add component-wise.
For word vectors from a trained word2vec model, approximately: v(king) - v(man) + v(woman) ≈ v(queen) v(king) - v(man) "royalty without the male attribute" + v(woman) "add the female attribute" ≈ v(queen) "queen" Arithmetic over lists of numbers reproduces meaning. There is no neural network in this computation - just addition and subtraction.
**Scalar multiplication** by $k$ preserves the direction and only scales the length. $k = 2$ - twice as long. $k = -1$ - opposite direction. $k = 0$ - collapses to the zero vector.
Vector $\mathbf{v} = (3, 4)$. What is its length $\|\mathbf{v}\|$?
Coordinates, Basis, and the Dot Product
The coordinates of a vector depend on the choice of **basis** - a set of reference arrows. The standard basis of $\mathbb{R}^n$: unit vectors $e_1 = (1, 0, \ldots, 0)$, $e_2 = (0, 1, 0, \ldots)$, and so on. Every vector is a linear combination of basis vectors.
| Object | Dimension | Who uses it |
|---|---|---|
| RGB color | 3 | Every screen |
| MNIST digit 28x28 | 784 | Neural-net textbooks |
| 224x224 RGB photo | 150,528 | ImageNet, ResNet, ViT |
| Word embedding (word2vec) | 300 | Search, translation 2013-2018 |
| Text embedding (OpenAI ada-002) | 1,536 | All RAG systems 2023+ |
| GPT-4 hidden state per token | ~12,288 | ChatGPT, Claude, Gemini |
The **dot product** of two vectors: $\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos \theta$, where $\theta$ is the angle between them. The result is a number, not a vector.
**Cosine similarity** = $\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$ measures how similar the directions are, independent of length. On normalized vectors ($\|v\| = 1$) cosine similarity = just the dot product. Every modern vector database and the attention mechanism in Transformers are built on this.
**Why normalize**: when a vector represents meaning, what matters is where it points, not how long it is. After normalization the length is 1 and distances depend only on the angle. OpenAI returns embeddings already normalized - this speeds up search in Qdrant / pgvector.
The dot product of two unit vectors (length 1) equals 0. What does this mean geometrically?
Linear Combinations, Span, and Linear Dependence
A **linear combination** of vectors $v_1, \ldots, v_k$ with coefficients $c_1, \ldots, c_k$ is $c_1 v_1 + c_2 v_2 + \ldots + c_k v_k$. Every linear layer in a neural network computes exactly this: a weight matrix multiplied by an input vector.
The **span** of a set of vectors is the set of all their linear combinations. Geometrically: everything reachable by adjusting the coefficients. The span of two non-collinear vectors in 3D is a plane. The span of three linearly independent vectors is all of $\mathbb{R}^3$.
**Linear dependence**: a set of vectors is linearly dependent if one of them can be expressed as a combination of the others. In ML this matters: a weight matrix with linearly dependent rows is redundant - it can be compressed without loss of information (PCA, SVD).
**Where this runs right now**: every GPT-4 token is a vector of ~12,288 dimensions. The attention mechanism is the dot product of queries and keys, normalized by softmax. YouTube recommendations, Spotify search, RAG in ChatGPT - all «find the nearest neighbor» via dot products.
| Task | Vector operation | Where it runs |
|---|---|---|
| Semantic search (RAG) | Text → embedding → nearest neighbor | ChatGPT plugins, Notion AI, Slack |
| Recommendations | User × Item embeddings, dot product | YouTube, TikTok, Netflix, Spotify |
| Image search | Image → embedding → nearest neighbor | Google Lens, product similarity search |
| Attention in Transformers | QK^T / sqrt(d_k) - dot product matrix | GPT-4, Claude, LLaMA, Gemini |
A linear layer computes $\mathbf{y} = W\mathbf{x} + \mathbf{b}$. What is $W\mathbf{x}$ geometrically?
Summary
- A vector = an ordered list of numbers. Geometrically: an arrow with length and direction
- Length: $\|\mathbf{v}\| = \sqrt{\sum v_i^2}$ - the Pythagorean theorem in any number of dimensions
- Dot product: $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i = \|\mathbf{a}\|\|\mathbf{b}\|\cos\theta$ - the foundation of cosine similarity and attention
- Linear combination is the basic operation: every linear layer = weight matrix × feature vector
- In ML the dimension is 768, 1536, 12,288. The rules of arithmetic do not change.
What's next
Vectors are letters. Words and sentences come next:
- Matrices and linear maps — A single linear layer is a matrix multiplication
- Eigenvectors and SVD — PCA, spectral clustering, PageRank, recommender SVD systems
- Dot product and projections — Cosine similarity, attention in Transformers, metric spaces
Вопросы для размышления
- Spotify stores listening history. How can it be turned into a vector so that similar users are nearby in vector space?
- Why does OpenAI return embeddings pre-normalized? What changes in the cosine similarity formula?
- A weight matrix with linearly dependent rows is redundant. What does this mean for model size and computational cost?