Linear Algebra

Vectors: The Language ML Speaks

2012: AlexNet wins ImageNet by a 10-point margin. The entire network is matrix multiplication. 60 million parameters = one giant weight vector. Understanding what happens inside means understanding vectors.

king - man + woman ≈ queen. Arithmetic over embeddings reproduces meaning. No magic - just vector addition.
RAG, Spotify search, TikTok recommendations - one pattern: compute a dot product, find the nearest neighbor.
Every GPT-4 token is a vector of ~12,288 dimensions. Attention is a matrix of dot products. The whole LLM is linear algebra.

Vector as an Arrow: the Geometric View

2012: AlexNet wins ImageNet by a 10-point margin. The entire network is matrix multiplication. 60 million parameters = one giant weight vector. Understanding what happens inside starts with geometry.

A vector is an arrow in space. Two properties define it: **length** (how much) and **direction** (which way). The tail of the arrow is usually pinned at the origin.

**Addition** follows the parallelogram rule. Place the tail of the second arrow at the tip of the first. The sum is a new arrow from the tail of the first to the tip of the second. In coordinates: add component-wise.

For word vectors from a trained word2vec model, approximately: v(king) - v(man) + v(woman) ≈ v(queen) v(king) - v(man) "royalty without the male attribute" + v(woman) "add the female attribute" ≈ v(queen) "queen" Arithmetic over lists of numbers reproduces meaning. There is no neural network in this computation - just addition and subtraction.

**Scalar multiplication** by $k$ preserves the direction and only scales the length. $k = 2$ - twice as long. $k = -1$ - opposite direction. $k = 0$ - collapses to the zero vector.

Vector $\mathbf{v} = (3, 4)$. What is its length $\|\mathbf{v}\|$?

Coordinates, Basis, and the Dot Product

The coordinates of a vector depend on the choice of **basis** - a set of reference arrows. The standard basis of $\mathbb{R}^n$: unit vectors $e_1 = (1, 0, \ldots, 0)$, $e_2 = (0, 1, 0, \ldots)$, and so on. Every vector is a linear combination of basis vectors.

Object	Dimension	Who uses it
RGB color	3	Every screen
MNIST digit 28x28	784	Neural-net textbooks
224x224 RGB photo	150,528	ImageNet, ResNet, ViT
Word embedding (word2vec)	300	Search, translation 2013-2018
Text embedding (OpenAI ada-002)	1,536	All RAG systems 2023+
GPT-4 hidden state per token	~12,288	ChatGPT, Claude, Gemini

The **dot product** of two vectors: $\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos \theta$, where $\theta$ is the angle between them. The result is a number, not a vector.

**Cosine similarity** = $\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$ measures how similar the directions are, independent of length. On normalized vectors ($\|v\| = 1$) cosine similarity = just the dot product. Every modern vector database and the attention mechanism in Transformers are built on this.

**Why normalize**: when a vector represents meaning, what matters is where it points, not how long it is. After normalization the length is 1 and distances depend only on the angle. OpenAI returns embeddings already normalized - this speeds up search in Qdrant / pgvector.

The dot product of two unit vectors (length 1) equals 0. What does this mean geometrically?

Linear Combinations, Span, and Linear Dependence

A **linear combination** of vectors $v_1, \ldots, v_k$ with coefficients $c_1, \ldots, c_k$ is $c_1 v_1 + c_2 v_2 + \ldots + c_k v_k$. Every linear layer in a neural network computes exactly this: a weight matrix multiplied by an input vector.

The **span** of a set of vectors is the set of all their linear combinations. Geometrically: everything reachable by adjusting the coefficients. The span of two non-collinear vectors in 3D is a plane. The span of three linearly independent vectors is all of $\mathbb{R}^3$.

**Linear dependence**: a set of vectors is linearly dependent if one of them can be expressed as a combination of the others. In ML this matters: a weight matrix with linearly dependent rows is redundant - it can be compressed without loss of information (PCA, SVD).

**Where this runs right now**: every GPT-4 token is a vector of ~12,288 dimensions. The attention mechanism is the dot product of queries and keys, normalized by softmax. YouTube recommendations, Spotify search, RAG in ChatGPT - all «find the nearest neighbor» via dot products.

Task	Vector operation	Where it runs
Semantic search (RAG)	Text → embedding → nearest neighbor	ChatGPT plugins, Notion AI, Slack
Recommendations	User × Item embeddings, dot product	YouTube, TikTok, Netflix, Spotify
Image search	Image → embedding → nearest neighbor	Google Lens, product similarity search
Attention in Transformers	QK^T / sqrt(d_k) - dot product matrix	GPT-4, Claude, LLaMA, Gemini

A linear layer computes $\mathbf{y} = W\mathbf{x} + \mathbf{b}$. What is $W\mathbf{x}$ geometrically?

Summary

A vector = an ordered list of numbers. Geometrically: an arrow with length and direction
Length: $\|\mathbf{v}\| = \sqrt{\sum v_i^2}$ - the Pythagorean theorem in any number of dimensions
Dot product: $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i = \|\mathbf{a}\|\|\mathbf{b}\|\cos\theta$ - the foundation of cosine similarity and attention
Linear combination is the basic operation: every linear layer = weight matrix × feature vector
In ML the dimension is 768, 1536, 12,288. The rules of arithmetic do not change.

What's next

Vectors are letters. Words and sentences come next:

Matrices and linear maps — A single linear layer is a matrix multiplication
Eigenvectors and SVD — PCA, spectral clustering, PageRank, recommender SVD systems
Dot product and projections — Cosine similarity, attention in Transformers, metric spaces

Вопросы для размышления

Spotify stores listening history. How can it be turned into a vector so that similar users are nearby in vector space?
Why does OpenAI return embeddings pre-normalized? What changes in the cosine similarity formula?
A weight matrix with linearly dependent rows is redundant. What does this mean for model size and computational cost?

Связанные уроки

calc-17-multivariable