Algebra

Linear Maps

Every linear layer in a neural network is a linear map. Convolution in a CNN is a linear map on patch space. The attention mechanism (before softmax) is three linear maps (Q, K, V). Understanding the kernel and image of these maps is the key to analyzing what the network sees and what it discards.

  • **Linear layers in DNNs:** torch.nn.Linear is a matrix W; ker(W) = features invisible to the layer; im(W) = reachable activations
  • **PCA as a linear map:** projecting onto k principal components is a rank-k linear operator; kernel = orthogonal complement of the top subspace
  • **Attention (pre-softmax):** QKᵀV is a sequence of linear maps; softmax is the nonlinear component that makes attention more than a linear operation

Предварительные знания

  • Vector Spaces

Definition and Matrix Representation

A map f: V → W is linear if f(αx + βy) = αf(x) + βf(y) for all x,y ∈ V and α,β ∈ F. This combines two conditions: additivity f(x+y) = f(x)+f(y) and homogeneity f(αx) = αf(x). With fixed bases for V and W, every linear map is uniquely determined by a matrix A ∈ ℝ^{m×n}.

**Matrix entry A[i][j]** is the i-th coordinate of the image of the j-th basis vector. To build the matrix of f, compute f(e₁), f(e₂), …, f(eₙ) and stack the results as columns.

torch.nn.Linear(in, out, bias=False) is exactly a weight matrix W ∈ ℝ^{out×in} implementing x ↦ Wx. With bias=True it adds a shift: x ↦ Wx + b - affine, not linear.

Is f(x) = Ax + b (with b ≠ 0) a linear map?

Kernel and Image

The kernel of f: V → W is ker(f) = {v ∈ V : f(v) = 0}. The image is im(f) = {f(v) : v ∈ V} ⊆ W. Both are subspaces: the kernel lives in V, the image in W. The kernel captures 'what is lost'; the image captures 'what is reachable'.

TermNotationMeaning in ML
Kernel (Null space)ker(f) = Null(A)Features the layer cannot see
Image (Column space)im(f) = Col(A)Reachable activations
Nullitydim ker(f)Degree of information loss
Rankdim im(f)Effective capacity of the layer

A linear map A: ℝ⁵ → ℝ³ has rank(A) = 2. What is dim ker(A)?

Rank-Nullity Theorem

The rank-nullity theorem: for any linear map f: V → W, rank(f) + nullity(f) = dim(V). Equivalently: dim(im f) + dim(ker f) = dim V. This is one of the central results of linear algebra - it says the 'information preserved' plus 'information lost' equals the total input dimension.

In neural networks: if input dim > output dim, then nullity > 0 - information is irreversibly lost (normal for compression). If input dim < output dim, the map cannot be surjective: not all output activations are reachable.

Under what condition is a linear map f: V → V an isomorphism?

Composition and Matrix Multiplication

The composition of linear maps f: U → V and g: V → W is the linear map (g∘f): U → W. In matrix form: if f has matrix A and g has matrix B, then g∘f has matrix BA. This is why matrix multiplication is defined the way it is - not an arbitrary convention.

**Why nonlinearities are necessary:** without relu/gelu/sigmoid, any sequence of linear layers collapses into a single linear map (a product of matrices). Nonlinearities are what enable neural networks to approximate arbitrary functions (universal approximation theorem).

Why does matrix product AB mean 'apply B first, then A'?

Key Ideas

  • **Linearity:** f(αx+βy) = αf(x)+βf(y); a linear map ↔ a matrix under fixed bases
  • **Kernel and image:** ker(f) = what is lost; im(f) = what is reachable; both are subspaces
  • **Rank-nullity:** rank + nullity = dim V; preserves the balance between information preserved and lost
  • **Composition = matrix multiplication:** B first, then A gives product AB; without nonlinearities a network is a single matrix

Related Topics

Linear maps connect all of linear algebra:

  • Eigenvalues and Eigenvectors — Eigenvectors are the invariant directions of a linear map
  • SVD — SVD gives the complete geometric picture of any linear map via three simple operations
  • Linear Algebra in ML — Attention as three linear maps; LoRA as a low-rank perturbation

Вопросы для размышления

  • A transformer has hundreds of linear layers separated by nonlinearities (LayerNorm, GELU). What would happen if all nonlinearities were removed?
  • LoRA replaces weight matrix W with W + AB where rank(AB) ≪ rank(W). How does this relate to the kernel and image of the perturbation AB?
  • The attention matrix QKᵀ ∈ ℝ^{n×n} for a sequence of length n. What is the rank of this matrix and what does it mean for the attention pattern?

Связанные уроки

  • la-06-transformations
Linear Maps

0

1

Sign In