Algebra

Linear Maps

Every linear layer in a neural network is a linear map. Convolution in a CNN is a linear map on patch space. The attention mechanism (before softmax) is three linear maps (Q, K, V). Understanding the kernel and image of these maps is the key to analyzing what the network sees and what it discards.

**Linear layers in DNNs:** torch.nn.Linear is a matrix W; ker(W) = features invisible to the layer; im(W) = reachable activations
**PCA as a linear map:** projecting onto k principal components is a rank-k linear operator; kernel = orthogonal complement of the top subspace
**Attention (pre-softmax):** QKᵀV is a sequence of linear maps; softmax is the nonlinear component that makes attention more than a linear operation

Предварительные знания

Vector Spaces

Definition and Matrix Representation

A map f: V → W is linear if f(αx + βy) = αf(x) + βf(y) for all x,y ∈ V and α,β ∈ F. This combines two conditions: additivity f(x+y) = f(x)+f(y) and homogeneity f(αx) = αf(x). With fixed bases for V and W, every linear map is uniquely determined by a matrix A ∈ ℝ^{m×n}.

**Matrix entry A[i][j]** is the i-th coordinate of the image of the j-th basis vector. To build the matrix of f, compute f(e₁), f(e₂), …, f(eₙ) and stack the results as columns.

torch.nn.Linear(in, out, bias=False) is exactly a weight matrix W ∈ ℝ^{out×in} implementing x ↦ Wx. With bias=True it adds a shift: x ↦ Wx + b - affine, not linear.

Is f(x) = Ax + b (with b ≠ 0) a linear map?

Kernel and Image

The kernel of f: V → W is ker(f) = {v ∈ V : f(v) = 0}. The image is im(f) = {f(v) : v ∈ V} ⊆ W. Both are subspaces: the kernel lives in V, the image in W. The kernel captures 'what is lost'; the image captures 'what is reachable'.

Term	Notation	Meaning in ML
Kernel (Null space)	ker(f) = Null(A)	Features the layer cannot see
Image (Column space)	im(f) = Col(A)	Reachable activations
Nullity	dim ker(f)	Degree of information loss
Rank	dim im(f)	Effective capacity of the layer

A linear map A: ℝ⁵ → ℝ³ has rank(A) = 2. What is dim ker(A)?

Rank-Nullity Theorem

The rank-nullity theorem: for any linear map f: V → W, rank(f) + nullity(f) = dim(V). Equivalently: dim(im f) + dim(ker f) = dim V. This is one of the central results of linear algebra - it says the 'information preserved' plus 'information lost' equals the total input dimension.

In neural networks: if input dim > output dim, then nullity > 0 - information is irreversibly lost (normal for compression). If input dim < output dim, the map cannot be surjective: not all output activations are reachable.

Under what condition is a linear map f: V → V an isomorphism?

Composition and Matrix Multiplication

The composition of linear maps f: U → V and g: V → W is the linear map (g∘f): U → W. In matrix form: if f has matrix A and g has matrix B, then g∘f has matrix BA. This is why matrix multiplication is defined the way it is - not an arbitrary convention.

**Why nonlinearities are necessary:** without relu/gelu/sigmoid, any sequence of linear layers collapses into a single linear map (a product of matrices). Nonlinearities are what enable neural networks to approximate arbitrary functions (universal approximation theorem).

Why does matrix product AB mean 'apply B first, then A'?

Key Ideas

**Linearity:** f(αx+βy) = αf(x)+βf(y); a linear map ↔ a matrix under fixed bases
**Kernel and image:** ker(f) = what is lost; im(f) = what is reachable; both are subspaces
**Rank-nullity:** rank + nullity = dim V; preserves the balance between information preserved and lost
**Composition = matrix multiplication:** B first, then A gives product AB; without nonlinearities a network is a single matrix

Вопросы для размышления

A transformer has hundreds of linear layers separated by nonlinearities (LayerNorm, GELU). What would happen if all nonlinearities were removed?
LoRA replaces weight matrix W with W + AB where rank(AB) ≪ rank(W). How does this relate to the kernel and image of the perturbation AB?
The attention matrix QKᵀ ∈ ℝ^{n×n} for a sequence of length n. What is the rank of this matrix and what does it mean for the attention pattern?

Связанные уроки

la-06-transformations

Definition and Matrix Representation

**Matrix entry A[i][j]** is the i-th coordinate of the image of the j-th basis vector. To build the matrix of f, compute f(e₁), f(e₂), …, f(eₙ) and stack the results as columns.

torch.nn.Linear(in, out, bias=False) is exactly a weight matrix W ∈ ℝ^{out×in} implementing x ↦ Wx. With bias=True it adds a shift: x ↦ Wx + b - affine, not linear.

Is f(x) = Ax + b (with b ≠ 0) a linear map?

Kernel and Image

Term

Notation

Meaning in ML

Kernel (Null space)

ker(f) = Null(A)

Features the layer cannot see

Image (Column space)

im(f) = Col(A)

Reachable activations

Nullity

dim ker(f)

Degree of information loss

Rank

dim im(f)

Effective capacity of the layer

A linear map A: ℝ⁵ → ℝ³ has rank(A) = 2. What is dim ker(A)?

Rank-Nullity Theorem

Under what condition is a linear map f: V → V an isomorphism?

Composition and Matrix Multiplication

Why does matrix product AB mean 'apply B first, then A'?

Key Ideas

**Linearity:** f(αx+βy) = αf(x)+βf(y); a linear map ↔ a matrix under fixed bases

**Kernel and image:** ker(f) = what is lost; im(f) = what is reachable; both are subspaces

**Rank-nullity:** rank + nullity = dim V; preserves the balance between information preserved and lost

**Composition = matrix multiplication:** B first, then A gives product AB; without nonlinearities a network is a single matrix

Linear Maps

Предварительные знания

Definition and Matrix Representation

Kernel and Image

Rank-Nullity Theorem

Composition and Matrix Multiplication

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Linear Maps

Предварительные знания

Definition and Matrix Representation

Kernel and Image

Rank-Nullity Theorem

Composition and Matrix Multiplication

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки