Linear Algebra
Tensor Decompositions: SVD for High-Dimensional Data
One hour of 1080p 60fps video is terabytes of data. Tucker and CP decompositions compress video tensors at controllable quality levels set by rank. In ML, tensor decompositions compress neural network weights - MobileNet-style models achieve 10x compression - and analyze multi-channel time series.
- Weight compression: Tucker decomposition of conv kernels (C_in, C_out, kH, kW) reduces parameters by 5-10x
- Recommender systems: User-Item-Context tensor, CP decomposition for context-aware recommendations
- Neuroscience: fMRI data (x, y, z, time) - Tucker separates spatial and temporal factors
- Chemistry: tensor of atomic interactions in a molecule for Graph Neural Networks
- Traffic prediction: (nodes, time, features) tensor - STAEformer uses tensor operations
CP decomposition: sum of rank-1 terms
**ResNet-50 weighs 100 MB. Tucker decomposition of its convolutional layers brings that down to 25 MB at the same ImageNet accuracy.** Spotify stores preferences of 600 million users across millions of tracks in different contexts - not as a giant table, but as a CP decomposition of a three-way tensor. Tensor Train compresses an embedding layer from 10^30 down to 20,000 parameters. One idea behind all of this: generalize SVD to tensors.
SVD owned the 2D world for fifty years. But real data is rarely 2D: *users × items × time* (Netflix), *batch × channels × H × W* (vision), *heads × queries × keys × values* (attention). **Tensor decompositions** are SVD pushed up the dimension ladder. Three flavors, three trade-offs, one unified idea: compress high-dim data into low-rank factors that are also interpretable.
CP Decomposition: a tensor as a sum of rank-1 terms
A rank-1 matrix is an outer product of two vectors: M = a·bᵀ. A rank-1 tensor is an outer product of three: T = a ⊗ b ⊗ c. **CP decomposition** (CANDECOMP/PARAFAC) represents any tensor as a sum of R such terms - a direct analogy to decomposing a matrix as a sum of rank-1 matrices in SVD.
TENSOR: T ∈ R^{I×J×K} CP DECOMPOSITION OF RANK R: T[i,j,k] ≈ Σ_{r=1}^R a[i,r] · b[j,r] · c[k,r] PARAMETERS: Original: I·J·K CP of rank R: R·(I + J + K) EXAMPLE: tensor 100×100×100 Original: 1,000,000 parameters CP, R=20: 20·300 = 6,000 parameters (167x compression) WHEN IT WORKS: when data has low-rank structure (user preferences, physical fields, recommender data)
**Tensor rank is NP-hard**: unlike matrix rank, computing the exact tensor rank is NP-hard. Worse, the set of tensors of rank ≤ R is not closed - a sequence of rank-r tensors can converge to a tensor of rank r+1. In practice, iterative algorithms (ALS - Alternating Least Squares) are used with no global optimality guarantees.
How does tensor rank in CP differ from matrix rank?
Matrix rank is easy via SVD: O(mn min(m,n)). For tensors of order >= 3, computing exact rank is NP-hard (Hillar & Lim, 2013). In practice approximate algorithms (ALS) are used with a fixed target rank.
Tucker decomposition and network compression
Tucker Decomposition: compressing neural network weights
Tucker is more flexible: instead of a single rank R, each mode gets its own rank. The core tensor G captures interactions between modes; factor matrices A, B, C project each axis into a smaller space. **Tucker is the standard method for compressing convolutional layers** in production: a 3×3 layer with 256 input and 256 output channels can be shrunk 4x with negligible accuracy loss.
TUCKER DECOMPOSITION, ranks (R1, R2, R3): T[i,j,k] ≈ Σ_{p,q,r} G[p,q,r] · A[i,p] · B[j,q] · C[k,r] PARAMETERS: Original: I·J·K Tucker: R1·R2·R3 + I·R1 + J·R2 + K·R3 EXAMPLE: conv layer (256, 256, 3, 3) Original: 256·256·3·3 = 589,824 Tucker (64,64,3,3): 64·64·3·3 + 256·64 + 256·64 = 69,632 ~8.5x fewer, accuracy loss < 1% on ImageNet CP = Tucker with a diagonal core G (special case)
**How Tucker compression works in practice**: decompose weights of an already trained CNN to get a low-rank approximation. Then fine-tune for a few epochs - almost all lost accuracy is recovered. This is the standard pipeline for compressing models for mobile deployment.
What is the main difference between Tucker and CP decomposition?
CP writes the tensor as a sum of r terms a_k ⊗ b_k ⊗ c_k with one shared rank. Tucker uses a core tensor S in R^(r1×r2×r3) times orthogonal projection matrices on each mode. Flexibility: each axis can have its own "effective dimension".
Tensor Train: matrix product chain
Tensor Train: exponential compression
GPT's embedding layer stores 50,000 tokens × 12,288 dimensions = 614 million parameters. Tensor Train (TT) decomposes the matrix as a chain of small tensors - bringing the parameter count to the thousands. The idea: rewrite indices in mixed-radix representation and factorize the tensor across each digit.
TT DECOMPOSITION: T(i1, i2, ..., in) = G1[i1] · G2[i2] · ... · Gn[in] where Gk[ik] is an r_{k-1} × r_k matrix PARAMETER COUNT: Original: d^n TT of rank r: n · d · r² EXAMPLE: tensor with n=10, d=100, r=10 Original: 100^10 = 10^20 (impossible to store!) TT: 10 · 100 · 100 = 100,000 EMBEDDING LAYER via TT: 50,000 × 12,288 = 614M parameters TT (n=8, ranks r=16): ~50,000 parameters Quality loss: < 2%
**TT = MPS in physics**: Tensor Train is known in quantum physics as Matrix Product State (MPS). The wave function of n qubits is a tensor with 2^n elements. When quantum entanglement is limited (area law), the TT rank is small and the state needs exponentially fewer parameters. That is what allows classical computers to simulate quantum circuits of 100+ qubits.
Comparing the three methods
| Method | Parameters | Best for | ML application |
|---|---|---|---|
| CP (rank R) | R·(I+J+K) | Data with additive structure | Recommender systems (user x item x time) |
| Tucker (R1,R2,R3) | R1R2R3 + IR1+JR2+KR3 | Different axes have different complexity | CNN convolutional layer compression |
| TT/MPS (rank r) | n·d·r² | Very high-dimensional data (n>>3) | Embedding layer compression, quantum simulation |
tensorly: one API for all decompositions
**tensorly** is a Python library for tensor computations with numpy, PyTorch, and TensorFlow backends. It is the standard tool for all three decompositions in both research and production.
**Tensor Decompositions in Industry** From theory to production - **Recommender Systems (CP)**: 3-way tensor user x item x context. Netflix, Amazon: CP decomposition handles temporal context better than matrix SVD. Accuracy improvement of 5-10%. - **CNN Compression (Tucker)**: Low-rank approximation of conv weights. MobileNet, EfficientNet use Tucker-style factorizations. 4x compression with < 1% accuracy drop on ImageNet. - **Embedding Compression (TT)**: Embedding layer as a TT matrix. Language models: Embedding 50k x 768 → TT with ranks 16 → 100x fewer parameters. Faster inference. - **Quantum Simulation (MPS=TT)**: Wave functions of qubits. Google, IBM Q: simulating 100+ qubit circuits on GPU via MPS. Foundation of modern quantum chemistry.
Key takeaways
- **CP decomposition**: T ≈ Σ aᵣ⊗bᵣ⊗cᵣ - additive structure, R·(I+J+K) parameters, used in recommender systems
- **Tucker**: T ≈ G ×₁A ×₂B ×₃C - core + factor matrices, separate ranks per mode, for CNN compression (4x with no accuracy loss)
- **Tensor Train**: T(i1,...,in) = G1[i1]·...·Gn[in] - exponential compression d^n → O(ndr²), for embedding layers
- **CP = Tucker** with a diagonal core - a special case
- **Tensor rank is NP-hard** - unlike matrix rank, there is no exact polynomial algorithm
- **tensorly** - the standard Python tool with numpy/PyTorch/TF backends
- Tucker compression pipeline: decompose → replace with three layers → fine-tune 5-10 epochs
Connections
Tensor decompositions build on SVD and use randomized methods for speed
- SVD and Singular Values — Tucker = multilinear generalization of SVD; HOSVD is SVD applied per mode
- Randomized Linear Algebra — Randomized SVD is the building block for fast Tucker and TT decompositions
- Linear Algebra in Deep Learning — Tucker decomposition for compressing neural network weights (MobileNet, pruning)