Linear Algebra

Vectors: The Language ML Speaks

2012: AlexNet wins ImageNet by a 10-point margin. The entire network is matrix multiplication. 60 million parameters = one giant weight vector. Understanding what happens inside means understanding vectors.

  • king - man + woman ≈ queen. Arithmetic over embeddings reproduces meaning. No magic - just vector addition.
  • RAG, Spotify search, TikTok recommendations - one pattern: compute a dot product, find the nearest neighbor.
  • Every GPT-4 token is a vector of ~12,288 dimensions. Attention is a matrix of dot products. The whole LLM is linear algebra.

Vector as an Arrow: the Geometric View

2012: AlexNet wins ImageNet by a 10-point margin. The entire network is matrix multiplication. 60 million parameters = one giant weight vector. Understanding what happens inside starts with geometry.

A vector is an arrow in space. Two properties define it: **length** (how much) and **direction** (which way). The tail of the arrow is usually pinned at the origin.

**Addition** follows the parallelogram rule. Place the tail of the second arrow at the tip of the first. The sum is a new arrow from the tail of the first to the tip of the second. In coordinates: add component-wise.

For word vectors from a trained word2vec model, approximately: v(king) - v(man) + v(woman) ≈ v(queen) v(king) - v(man) "royalty without the male attribute" + v(woman) "add the female attribute" ≈ v(queen) "queen" Arithmetic over lists of numbers reproduces meaning. There is no neural network in this computation - just addition and subtraction.

**Scalar multiplication** by $k$ preserves the direction and only scales the length. $k = 2$ - twice as long. $k = -1$ - opposite direction. $k = 0$ - collapses to the zero vector.

Vector $\mathbf{v} = (3, 4)$. What is its length $\|\mathbf{v}\|$?

Coordinates, Basis, and the Dot Product

The coordinates of a vector depend on the choice of **basis** - a set of reference arrows. The standard basis of $\mathbb{R}^n$: unit vectors $e_1 = (1, 0, \ldots, 0)$, $e_2 = (0, 1, 0, \ldots)$, and so on. Every vector is a linear combination of basis vectors.

ObjectDimensionWho uses it
RGB color3Every screen
MNIST digit 28x28784Neural-net textbooks
224x224 RGB photo150,528ImageNet, ResNet, ViT
Word embedding (word2vec)300Search, translation 2013-2018
Text embedding (OpenAI ada-002)1,536All RAG systems 2023+
GPT-4 hidden state per token~12,288ChatGPT, Claude, Gemini

The **dot product** of two vectors: $\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos \theta$, where $\theta$ is the angle between them. The result is a number, not a vector.

**Cosine similarity** = $\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$ measures how similar the directions are, independent of length. On normalized vectors ($\|v\| = 1$) cosine similarity = just the dot product. Every modern vector database and the attention mechanism in Transformers are built on this.

**Why normalize**: when a vector represents meaning, what matters is where it points, not how long it is. After normalization the length is 1 and distances depend only on the angle. OpenAI returns embeddings already normalized - this speeds up search in Qdrant / pgvector.

The dot product of two unit vectors (length 1) equals 0. What does this mean geometrically?

Linear Combinations, Span, and Linear Dependence

A **linear combination** of vectors $v_1, \ldots, v_k$ with coefficients $c_1, \ldots, c_k$ is $c_1 v_1 + c_2 v_2 + \ldots + c_k v_k$. Every linear layer in a neural network computes exactly this: a weight matrix multiplied by an input vector.

The **span** of a set of vectors is the set of all their linear combinations. Geometrically: everything reachable by adjusting the coefficients. The span of two non-collinear vectors in 3D is a plane. The span of three linearly independent vectors is all of $\mathbb{R}^3$.

**Linear dependence**: a set of vectors is linearly dependent if one of them can be expressed as a combination of the others. In ML this matters: a weight matrix with linearly dependent rows is redundant - it can be compressed without loss of information (PCA, SVD).

**Where this runs right now**: every GPT-4 token is a vector of ~12,288 dimensions. The attention mechanism is the dot product of queries and keys, normalized by softmax. YouTube recommendations, Spotify search, RAG in ChatGPT - all «find the nearest neighbor» via dot products.

TaskVector operationWhere it runs
Semantic search (RAG)Text → embedding → nearest neighborChatGPT plugins, Notion AI, Slack
RecommendationsUser × Item embeddings, dot productYouTube, TikTok, Netflix, Spotify
Image searchImage → embedding → nearest neighborGoogle Lens, product similarity search
Attention in TransformersQK^T / sqrt(d_k) - dot product matrixGPT-4, Claude, LLaMA, Gemini

A linear layer computes $\mathbf{y} = W\mathbf{x} + \mathbf{b}$. What is $W\mathbf{x}$ geometrically?

Summary

  • A vector = an ordered list of numbers. Geometrically: an arrow with length and direction
  • Length: $\|\mathbf{v}\| = \sqrt{\sum v_i^2}$ - the Pythagorean theorem in any number of dimensions
  • Dot product: $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i = \|\mathbf{a}\|\|\mathbf{b}\|\cos\theta$ - the foundation of cosine similarity and attention
  • Linear combination is the basic operation: every linear layer = weight matrix × feature vector
  • In ML the dimension is 768, 1536, 12,288. The rules of arithmetic do not change.

What's next

Vectors are letters. Words and sentences come next:

  • Matrices and linear maps — A single linear layer is a matrix multiplication
  • Eigenvectors and SVD — PCA, spectral clustering, PageRank, recommender SVD systems
  • Dot product and projections — Cosine similarity, attention in Transformers, metric spaces

Вопросы для размышления

  • Spotify stores listening history. How can it be turned into a vector so that similar users are nearby in vector space?
  • Why does OpenAI return embeddings pre-normalized? What changes in the cosine similarity formula?
  • A weight matrix with linearly dependent rows is redundant. What does this mean for model size and computational cost?

Связанные уроки

  • calc-17-multivariable
Vectors: The Language ML Speaks

0

1

Sign In