Linear Algebra
Tensors: From Matrices to Image Batches
A batch of 32 RGB images of size 224x224 in PyTorch is a tensor of shape (32, 3, 224, 224). Every convolutional layer operation is a tensor contraction. Without understanding tensors you cannot correctly write PyTorch code - shape errors and broadcasting bugs are the most common mistakes in ML.
- PyTorch: image batch (N, C, H, W) - the 4D tensor at the heart of computer vision
- NLP: sequence batch (batch, seq_len, d_model) - the 3D tensor in transformers
- Physics simulation: stress tensor - a 3x3 matrix at each point of a 3D solid
- Video: (T, C, H, W) - 4D tensor for video classification in Video Transformers
- Multi-head attention: weights (batch, n_heads, seq, seq) - a 4D tensor
Tensor definition and rank
**PyTorch. TensorFlow. TensorRT.** All three are named after the same object. A batch of 32 images of 224x224 RGB is not "three matrices" - it is one four-dimensional tensor of shape (32, 3, 224, 224). Attention in a transformer is a contraction of three-dimensional tensors Q, K, V. Convolutional weights are stored as a tensor (out_channels, in_channels, kH, kW). All of modern deep learning is tensor algebra on GPUs.
**Two meanings of 'tensor'**: physicists mean a multilinear map with coordinate transformation laws (metric tensor, stress tensor). ML engineers mean a multidimensional array. This lesson is about the second; the first is mentioned to avoid confusion when reading papers.
Tensor rank: from scalar to 4D array
**Rank** of a tensor is the number of indices (axes). Not to be confused with matrix rank. Four ranks that appear in ML every day:
| Rank | Shape | What it is in ML | Example |
|---|---|---|---|
| 0 (scalar) | () | Loss, learning rate, accuracy | loss = 0.3451 |
| 1 (vector) | (d,) | Embedding, bias, 1D signal | emb = (1536,) for ada-002 |
| 2 (matrix) | (n, m) | Weight matrix, attention mask | W = (768, 3072) in BERT FFN |
| 3 (cube) | (B, T, D) | Token batch in transformer | (32, 128, 512) - batch x seq x embed |
| 4 (hypercube) | (B, C, H, W) | Image batch in CNN | (32, 3, 224, 224) - ImageNet batch |
What does "tensor rank" mean in ML frameworks like PyTorch?
In ML "tensor rank" (PyTorch, TF) is the number of axes: scalar has rank 0, vector 1, matrix 2, RGB image 3, image batch 4. Not to be confused with "tensor rank" from CP decomposition, which is the minimal number of terms a ⊗ b ⊗ c.
Operations: einsum, broadcasting, reshape
Einstein summation: one syntax for everything
**Einstein notation** is a compact way to write tensor operations: a repeated index means summation. In numpy and PyTorch this is `einsum`. Half the operations in deep learning fit in a single line of this syntax.
VECTOR DOT PRODUCT: a.b = sum_i a_i b_i -> 'i,i->' MATRIX MULTIPLY: C[i,k] = sum_j A[i,j] B[j,k] -> 'ij,jk->ik' TRANSPOSE: B[j,i] = A[i,j] -> 'ij->ji' MATRIX TRACE (contraction): tr(A) = sum_i A[i,i] -> 'ii->' BATCHED MATRIX MULTIPLY (transformer): Y[b,t,h] = sum_d X[b,t,d] * W[d,h] -> 'btd,dh->bth' ATTENTION SCORES: S[b,t,s] = sum_d Q[b,t,d] * K[b,s,d] -> 'btd,bsd->bts' OUTER PRODUCT: M[i,j] = a_i * b_j -> 'i,j->ij'
**In PyTorch**: `torch.einsum('btd,bsd->bts', Q, K)` is an exact equivalent. There is also `torch.bmm` (batched matrix multiply) and `torch.matmul`, which apply einsum semantics automatically for 3D+ tensors.
Multi-head attention: the tensor picture
Multi-head attention in transformers (BERT, GPT, T5) runs H independent attention mechanisms simultaneously. The implementation via reshape and einsum does this without any Python loops.
Tensor product and outer product
The tensor product a ⊗ b of two vectors is a matrix where element (i,j) = a[i] * b[j]. The rank of the tensor product of two rank-1 tensors is 2. This operation is fundamental to both attention and convolutions.
a in R^m, b in R^n -> a x b in R^(m x n) a = [1, 2, 3] (shape: (3,)) b = [4, 5] (shape: (2,)) a x b = [ 1*4 1*5 ] = [ 4 5 ] [ 2*4 2*5 ] [ 8 10 ] [ 3*4 3*5 ] [ 12 15 ] shape: (3, 2) = (3,) x (2,) Generalization: T in R^(d1) x R^(d2) x ... x R^(dk) shape: (d1, d2, ..., dk) dim = d1 * d2 * ... * dk
**LoRA (Low-Rank Adaptation)** - the fine-tuning technique from Hu et al. 2021. Instead of updating the full weight matrix W, it adds delta_W = A @ B where rank << d. This is literally a tensor product of two small matrices. All PEFT methods build on this: LoRA, QLoRA, DoRA.
What does torch.einsum("btd,bsd->bts", Q, K) compute?
Einstein summation: index d is summed (it repeats on input and is absent on output); indices b, t, s are free. This is the heart of attention: a batched matrix product between Q (B,T,D) and K^T (B,D,S) yields scores of shape (B,T,S).
Tensor decompositions: Tucker, CP, TT
Tucker decomposition: compressing neural network weights
SVD decomposes a matrix into a product of three factors. Tucker decomposition is the generalization of SVD to tensors of arbitrary rank. It is used to compress the weights of convolutional and transformer layers.
Tensor T in R^(I x J x K) is decomposed: T ~= G x1 U1 x2 U2 x3 U3 where: G in R^(R1 x R2 x R3) - the core tensor U1 in R^(I x R1), U2 in R^(J x R2), U3 in R^(K x R3) - factor matrices R1 << I, R2 << J, R3 << K - truncation ranks Compression ratio: Original: I*J*K parameters After Tucker: R1*R2*R3 + I*R1 + J*R2 + K*R3 parameters Example (convolutional layer in ResNet): T: (256, 256, 9) -> Tucker with ranks (64, 64, 9): 256*256*9 = 589,824 -> 64*64*9 + 256*64 + 256*64 + 9*9 = 36,864 + 32,768 + 81 = 69,713 ~8.5x compression with minimal accuracy loss
Broadcasting and reshape
Two operations that appear in every ML codebase: **broadcasting** (implicit dimension expansion) and **reshape** (reordering elements without copying data).
**Tensors in the modern ML stack** Where tensors live concretely in production systems - **PyTorch / JAX / TensorFlow**: Foundation: all parameters, activations, gradients are tensors. Autograd tracks the computation graph over tensors; XLA/cuDNN optimizes einsum - **Transformer (BERT, GPT, T5)**: Attention = tensor contraction (B,H,T,D) x (B,H,D,T). Flash Attention speeds up exactly this operation on GPU via tiling - **CNN (ResNet, EfficientNet)**: Convolution = tensor op (B,C_out,H,W) x (C_out,C_in,kH,kW). cuDNN im2col + GEMM; Tucker decomposition for compression - **LoRA / QLoRA**: Fine-tuning via low-rank tensor additions delta_W = A @ B. Rank 4-64 instead of thousands; Hugging Face PEFT library - **TensorRT / ONNX**: Optimizing tensor operations for inference. Op fusion, quantization; the same tensor graph runs faster
Practice: RGB Image as 3D Tensor
**A batch of 32 images (224x224 RGB) is fed into ResNet-50. What is the shape of the input tensor and how much memory does it occupy in float32?** Hints: PyTorch uses the (B, C, H, W) format; float32 = 4 bytes per element - Shape: (32, 3, 224, 224) - batch x channels x height x width - Elements: 32*3*224*224 = 4,816,896 - Memory: 4,816,896 * 4 bytes = 19.3 MB for this one batch alone - The real memory pressure comes when intermediate activation tensors from all layers are summed during backprop --- **What is the difference between `np.einsum('ij,jk->ik', A, B)` and `A @ B`? When is einsum actually necessary?** Hints: What does @ do for 2D matrices?; What if the contraction involves a non-standard subset of axes? - For 2D matrices the result is identical; @ is a special case of einsum - einsum is needed for batched operations with a non-standard axis layout - Example: 'btd,bsd->bts' - contracts only over d, keeps b,t,s - einsum reads like an explicit summation formula - self-documenting code --- **What is Tucker decomposition and how is it applied to compressing neural networks?** Hints: How does SVD factor a matrix?; What if the matrix is actually a 4-axis tensor? - Tucker is the generalization of SVD: T ~= G x1 U1 x2 U2 x3 U3 where G is a core tensor - Truncating the ranks R1, R2, R3 below the original dimensions gives compression - For a convolutional layer (C_out, C_in, kH, kW), Tucker yields 5-10x compression - Followed by fine-tuning to recover accuracy; tensorly library implements this
Why is Tucker decomposition applied to ResNet convolutional layers?
A ResNet convolutional layer stores W with shape (out, in, kH, kW). Tucker factors it as a core S times 4 projection matrices. With ranks (r1, r2, r3, r4) instead of (out, in, kH, kW) the compression is tens-of-times and inference accelerates on mobile devices with minimal accuracy loss.
Takeaways from this lesson
- **Tensor rank** = number of axes; rank-0 is a scalar, rank-4 is an image batch (B,C,H,W)
- **einsum** expresses any tensor operation via indices: repeated index = summation
- **Multi-head attention** = einsum('bhti,bhsi->bhts', Q, K) - contraction over d_head
- **LoRA** = low-rank tensor product delta_W = A @ B, rank r << d
- **Tucker decomposition** - SVD for tensors; 5-10x compression of CNN weights with minimal quality loss
- **Broadcasting** mixes tensors of different shapes without explicit data copying
- **PyTorch/JAX** - the entire stack is tensor algebra; understanding tensors means understanding the architecture
What comes next
Tensors are the language all of deep learning is written in. The following topics use this language directly.
- SVD — Tucker decomposition generalizes SVD; understanding SVD means understanding tensor decompositions
- Jordan Normal Form — Spectral theory of matrices is a special case of tensor algebra
- Linear Algebra in Deep Learning — How tensor operations are implemented concretely in attention and convolutions