Linear Algebra

Vector Operations: The Arithmetic ML Runs On

A vector is not just a list of numbers - it is a point in space, a direction, a word embedding, a pixel. Every ML model, from linear regression to GPT, is written in the language of vector operations. Understanding them is understanding the grammar of modern AI.

  • ML: every training example is a feature vector
  • NLP: the word 'king' is a vector of 768 numbers in BERT's embedding space
  • Computer graphics: position, normal, and color of a vertex are all vectors
  • Physics: velocity, acceleration, force are all vector quantities
  • Finance: a portfolio is a weight vector over assets

Vector Operations: The Arithmetic ML Runs On

**A 4-layer neural network is 4 rounds of addition and matrix multiplication.** Style transfer in Stable Diffusion is a linear combination of latent vectors. Batch normalization in every ResNet layer subtracts the mean vector and divides by the standard deviation vector. Vector operations look like dull math, but they are the entire forward pass of any neural network. There is not one extra line of code beyond them.

**What this lesson actually teaches**: not "how to add components", but why the same four operations produce school geometry, word2vec, and the linear layers of neural networks. By the end, "add vectors" and "run a forward pass" will turn out to be the same thing.

What is the key idea of the concept 'Vector Operations: The Arithmetic ML Runs On'?

Check that the concept material has been understood.

Addition: component-wise, always

Addition: component-wise, always

Two vectors are added strictly component-by-component: the i-th with the i-th. No mixing across positions.

FORMULA: a + b = (a₁ + b₁, a₂ + b₂, ..., aₙ + bₙ) EXAMPLE: a = (3, 1, 4) b = (1, 5, 2) ───────────── a + b = (4, 6, 6) <- each component separately In numpy: a + b <- broadcasting, one operation across the whole array

**Geometric meaning**: place the tail of the second vector at the tip of the first. The sum is the arrow from the start of the first to the end of the second. This is exactly how bias addition works in a linear layer: the same vector b is added to every vector in the batch.

The properties of addition are not an academic list - they are tools for simplifying computations:

  • **Commutativity**: a + b = b + a (order does not matter)
  • **Associativity**: (a + b) + c = a + (b + c) (brackets can be moved freely)
  • **Zero vector**: a + 0 = a (adding zero changes nothing)
  • **Additive inverse**: a + (-a) = 0 (subtracting a vector from itself)

What is the key idea of the concept 'Addition: component-wise, always'?

Check that the concept material has been understood.

Subtraction: difference as displacement

Subtraction: difference as displacement

Subtraction is adding the opposite vector. Geometrically **a - b** is the arrow **from the tip of b to the tip of a** (when both start at the origin). This is exactly the operation used to find the "direction" from one object to another in embedding space.

FORMULA: a - b = (a₁ - b₁, a₂ - b₂, ..., aₙ - bₙ) EXAMPLE: a = (5, 3, 1) b = (2, 1, 4) ───────────── a - b = (3, 2, -3) INTUITION (word2vec): v(king) - v(man) = "royalty vector without the male attribute" v(Paris) - v(France) = "capital-city direction vector" This is literally vector subtraction - nothing more.

What is the key idea of the concept 'Subtraction: difference as displacement'?

Check that the concept material has been understood.

Scalar multiplication: the volume knob

Scalar multiplication: the volume knob

Multiplying by scalar k scales every component. The direction stays - only the "intensity" changes.

FORMULA: k · a = (k·a₁, k·a₂, ..., k·aₙ) WHAT HAPPENS FOR DIFFERENT k: k = 2: same direction, twice as long k = 0.5: same direction, half as long k = -1: 180° reversal, same length k = 0: vector collapses to zero EXAMPLE: v = (2, -1, 3) 3v = (6, -3, 9) -v = (-2, 1, -3) 0.5v = (1, -0.5, 1.5)

**ML application - style steering**: researchers found that scaling an activation vector by a scalar shifts the "tone" of an LLM's response. The "Activation Addition" paper (2023): simply adding 1.5 * emb("happy") to the model's hidden state reliably makes responses more positive. No fine-tuning, no RLHF - just scalar multiplication and vector addition.

What is the key idea of the concept 'Scalar multiplication: the volume knob'?

Check that the concept material has been understood.

Linear combination: the foundation of neural networks

Linear combination: the foundation of neural networks

A **linear combination** is a weighted sum of vectors. Literally every neuron rests on this operation.

GENERAL FORM: α₁v₁ + α₂v₂ + ... + αₙvₙ EXAMPLE - decomposition through the standard basis of R²: e₁ = (1, 0), e₂ = (0, 1) (3, 5) = 3·e₁ + 5·e₂ <- linear combination of basis vectors ONE NEURON = LINEAR COMBINATION + NONLINEARITY: output = σ(w₁x₁ + w₂x₂ + ... + wₙxₙ + b) = σ(w · x + b) where w · x is exactly a linear combination of the components of x with weights w. Matrix × vector = a set of linear combinations = the forward pass of one linear layer.

**Critical fact**: a linear combination of linear combinations is itself a linear combination. Therefore a network with no nonlinear activations is equivalent to a single linear layer, regardless of depth. Nonlinearities (ReLU, GELU, sigmoid) are what turns a stack of matrix multiplications into a powerful function approximator.

What is the key idea of the concept 'Linear combination: the foundation of neural networks'?

Check that the concept material has been understood.

Normalization: unit length as a standard

Normalization: unit length as a standard

Normalizing a vector means dividing it by its length. The result is a vector with the same direction but unit length. Normalized vectors are the foundation of cosine similarity and vector databases.

FORMULA: v̂ = v / ||v|| where ||v|| = sqrt(v₁² + v₂² + ... + vₙ²) CHECK: ||v̂|| = ||v/||v|||| = ||v||/||v|| = 1 OK EXAMPLE: v = (3, 4) ||v|| = sqrt(9 + 16) = 5 v̂ = (3/5, 4/5) = (0.6, 0.8) ||v̂|| = sqrt(0.36 + 0.64) = 1 OK SPECIAL CASE: The zero vector cannot be normalized (division by zero). In numpy: np.linalg.norm([0,0,0]) = 0.0

The zero vector cannot be normalized - it has no direction. In production code always guard: if np.linalg.norm(v) > 1e-8 before normalizing.

What is the key idea of the concept 'Normalization: unit length as a standard'?

Check that the concept material has been understood.

Batch Normalization: subtraction and scaling in neural networks

Batch Normalization: subtraction and scaling in neural networks

**Batch Normalization** (BatchNorm) is one of the key techniques in deep learning. Inside it is vector subtraction and division by a vector - literally the operations from this lesson.

**Why BatchNorm works**: without normalization, activations in deep networks "drift" - the mean and variance shift gradually, gradients explode or vanish. BatchNorm pins the distribution after each layer. Result: ResNet-50 converges in hours, not days. All operations are vector subtraction and vector division.

What is the key idea of the concept 'Batch Normalization: subtraction and scaling in neural networks'?

Check that the concept material has been understood.

Where matrix operations actually run

Where matrix operations actually run

Vector operations in ML systems

From basic operation to real application

ComponentRoleDetails
Addition: bias in a linear layeroutput = W @ x + bEvery linear layer in a neural network: adding bias vector b to each element of the batch
Subtraction: word2vec analogiesv(king) - v(man) + v(woman) ~ v(queen)Difference vectors in embedding space - semantic axes
Scalar multiply: style steering in LLMshidden += scale * steering_vectorActivation Addition: tone control without fine-tuning
Linear combination: forward passW @ x = a batch of dot productsEvery Dense/Linear layer in PyTorch, TensorFlow, JAX
Normalization: vector databasesv / ||v|| before upsert into Qdrant/FAISSOpenAI ada-002 returns pre-normalized; others must normalize manually
Batch ops: BatchNorm / LayerNorm(X - mean) / std, then gamma * x + betaResNet, ViT, Transformer - BatchNorm or LayerNorm after each layer

What is the key idea of the concept 'Where matrix operations actually run'?

Check that the concept material has been understood.

Practice: L2 normalization

Practice: L2 normalization

Interview questions

Why can two consecutive linear layers be replaced by a single linear layer?

- First layer: y = W₁x + b₁ - Second layer: z = W₂y + b₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) - This is also a linear layer: z = Wx + b, where W = W₂W₁, b = W₂b₁ + b₂ - Any number of linear layers without activations = one linear layer - This is exactly why nonlinear activations (ReLU, GELU) are necessary

What is the point of normalizing embeddings before cosine similarity?

- cosine_sim(a,b) = (a·b) / (||a||·||b||) - normalization makes the denominator 1 - Normalized form: just a · b - one dot product instead of three operations - Vector length does not carry semantic meaning - direction does - FAISS Inner Product metric requires normalized inputs - OpenAI ada-002 pre-normalizes - an engineering convention, not a mathematical requirement

BatchNorm vs LayerNorm - what is the fundamental difference?

- BatchNorm: mean/std over the batch axis (axis=0) - statistics from the whole batch - LayerNorm: mean/std over the feature axis (axis=-1) - statistics from one sample - BatchNorm degenerates at batch_size=1; LayerNorm is always stable - Transformers use LayerNorm: autoregressive generation runs one token at a time - CNNs with large batches use BatchNorm: ResNet, VGG, EfficientNet

What is the key idea of the concept 'Practice: L2 normalization'?

Check that the concept material has been understood.

What to take from this lesson

  • **Addition/subtraction** is component-wise; a + b in a neural network = adding bias b to every element of the batch
  • **Scalar multiplication** scales without changing direction; the foundation of style steering in LLMs
  • **Linear combination** = one neuron = one row of weight matrix W dotted with input x
  • **Normalization** v/||v|| - the standard before vector search; makes cosine_sim equal to the dot product
  • **BatchNorm** - subtracting the mean vector and dividing by the std vector; not magic, just vector operations
  • Two linear layers without activation = one linear layer; nonlinearity is not optional

What's next

Vector operations are the alphabet. Matrices and what to do with them come next.

  • Dot product — The dot product - foundation of cosine similarity, attention, and projections
  • Matrices and operations — Matrix times vector = a batch of linear combinations = one neural network layer
  • Eigenvectors and SVD — PCA projects data onto principal components through the same dot products

Связанные уроки

  • calc-18-partial
Vector Operations: The Arithmetic ML Runs On

0

1

Sign In