Linear Algebra
Jordan Normal Form: When Diagonalization Breaks Down
Most matrices in practice are diagonalizable. But when they are not - repeated eigenvalues with insufficient eigenvectors - the Jordan Normal Form reveals the hidden structure. This situation appears in degenerate control systems, defective RNNs, and certain symmetry-breaking models.
- Control theory: a double pole in the transfer function signals a Jordan block - critical damping
- RNN analysis: defective weight matrices lead to polynomial (not exponential) growth modes
- Numerical analysis: computing f(A) for non-diagonalizable A requires Jordan form theory
- Differential equations: repeated eigenvalues produce t·e^(λt) solutions - from Jordan blocks
- Symmetry breaking: degenerate eigenspaces appear in models with exact symmetry
Defective matrices and Jordan blocks
**Diagonalization fails at the most interesting moment.** A 4x4 matrix with four identical eigenvalues - and only a single eigenvector. No basis can be formed, the decomposition A = PDP⁻¹ collapses. Jordan normal form is what steps in: an almost-diagonal matrix with ones above the diagonal exactly where the space was short a dimension. This structure explains why gradients in RNNs vanish - not just "because lambda is small", but because of polynomial factors hiding in those ones.
Why do some RNNs train and others just explode? Why does the same learning rate that works for one weight matrix blow up for a nearly identical one? **Jordan form** is the answer. When two eigenvalues collide, a normally diagonalizable matrix loses its diagonal form - and iterates of it grow *polynomially*, not exponentially. That polynomial growth is the exact failure mode behind every vanilla RNN ever trained.
Defective matrices: not enough eigenvectors
A matrix is **defective** when it has fewer linearly independent eigenvectors than its dimension. This happens with **repeated eigenvalues** when the ~algebraic multiplicity~{how many times the root appears in the characteristic polynomial} exceeds the ~geometric multiplicity~{dimension of the eigenspace - how many independent eigenvectors exist for that eigenvalue}.
A = [ 3 1 ] characteristic polynomial: (lambda - 3)^2 = 0 [ 0 3 ] single eigenvalue lambda = 3 (multiplicity 2) (A - 3I) = [ 0 1 ] kernel: only e1 = (1, 0) [ 0 0 ] Algebraic multiplicity of lambda=3: 2 Geometric multiplicity of lambda=3: 1 Not enough eigenvectors for a basis - the matrix is defective.
**Quick defectivity check**: if det(A - lambda*I) = 0 and rank(A - lambda*I) > 0, the matrix is defective. In numpy: compare the number of eigenvectors from `np.linalg.eig` against `n - np.linalg.matrix_rank(np.eye(n)*lam - A)`.
Jordan block: almost diagonal
A Jordan block of size k for eigenvalue lambda has lambda on the diagonal and **ones above it**. Each one marks exactly where the space was missing a dimension:
[ lambda 1 0 0 ] [ 0 lambda 1 0 ] J_4(l)= [ 0 0 lambda 1 ] [ 0 0 0 lambda ] Larger block = more "entanglement" in the space. J_1(lambda) = [lambda] - just a plain diagonal entry. The full Jordan matrix is block-diagonal: [ J_{k1}(l1) 0 ... 0 ] J = [ 0 J_{k2}(l2) ... 0 ] [ ... ... ... ... ] [ 0 0 ... J_{km}(lm) ] For a diagonalizable matrix, all blocks have size 1 - ordinary diagonal form.
What makes a matrix defective (non-diagonalisable)?
Algebraic multiplicity = number of times λ appears in the characteristic polynomial. Geometric multiplicity = dimension of the eigenspace. When geometric is strictly smaller, there are not enough eigenvectors to form a basis, so the matrix is non-diagonalisable. Example: [[2,1],[0,2]] has λ=2 twice but only one eigenvector.
Generalised eigenvectors
Generalized eigenvectors: completing the basis
Since ordinary eigenvectors fall short, **generalized eigenvectors** are added - vectors from the kernels of powers of (A - lambda*I). Each next vector in the chain comes from applying (A - lambda*I) to the previous one:
For eigenvalue lambda and a Jordan block of size k, build a chain v1, v2, ..., vk: (A - lI) vk = vk-1 generalized eigenvector of rank k (A - lI) vk-1 = vk-2 ... (A - lI) v2 = v1 (A - lI) v1 = 0 v1 is an ordinary eigenvector Example (the 2x2 block from above): v1 = (1, 0) (A - 3I)v1 = 0 v2: such that (A - 3I)v2 = v1 [0 1; 0 0] v2 = [1; 0] v2 = (0, 1) works: [0*0+1*1; 0*0+0*1] = [1; 0] ok P = [v1 | v2] = I, J = A (the matrix was already in Jordan form).
The nilpotent part and ML
Every Jordan block splits as J_k(lambda) = lambda*I + N, where N is a **nilpotent matrix** (N^k = 0). N is responsible for the polynomial growth that hides inside the block - and in neural networks this has a direct, measurable consequence.
J_2(l)^n = [l^n n*l^(n-1)] [0 l^n ] J_3(l)^n = [l^n n*l^(n-1) C(n,2)*l^(n-2)] [0 l^n n*l^(n-1) ] [0 0 l^n ] Key point: if |lambda| < 1 then lambda^n -> 0, but n*lambda^(n-1) decays more slowly. If |lambda| = 1 (the critical case): lambda^n is bounded, but n*lambda^(n-1) grows without bound. This is polynomial growth fighting exponential decay.
**Connection to vanishing gradients in RNNs**: if the recurrent weight matrix has a Jordan block with |lambda| < 1, the gradient flowing back through t steps is proportional to t * lambda^t. At lambda = 0.9 and t = 100 this is 100 * 0.9^100 = 0.00003 - essentially zero. This is why LSTM and GRU use gating: the gates prevent "bad" Jordan blocks from forming during training.
What is a generalised eigenvector of level k?
Generalised eigenvectors complete the basis when ordinary eigenvectors are insufficient. The chain v_1, v_2, ..., v_k satisfies (A - λI)v_{i+1} = v_i with v_1 an ordinary eigenvector. The chain forms a Jordan block of size k.
Matrix exponential and applications
Matrix exponential: solving systems of ODEs
The system x'(t) = A*x(t) has solution x(t) = e^(At)*x(0). Jordan form makes the matrix exponential explicit and computable.
e^(J_k(l)*t) = e^(lt) * [ 1 t t^2/2! ... t^(k-1)/(k-1)! ] [ 0 1 t ... ] [ 0 0 1 ... ] [ 0 0 0 ... 1 ] For J_2(lambda): e^(Jt) = e^(lt) * [ 1 t ] [ 0 1 ] Resonance: repeated roots produce solutions of the form t*e^(lt), t^2*e^(lt), ... Even when Re(lambda) < 0 (a stable system), transients can last far longer than lambda alone would predict.
Jordan form in control theory and RNNs
Jordan form appears wherever the time-domain behavior of a linear system matters - in control engineering and in sequence models.
| Field | How it is used | What Jordan form reveals |
|---|---|---|
| Control theory | Stability of x' = Ax | Stable iff Re(lambda) < 0 for all eigenvalues |
| RNN / LSTM | Gradient dynamics during BPTT | Block with |lambda| > 1 - explosion; |lambda| < 1 - vanishing |
| Signal processing | Systems with repeated poles (resonance) | Terms t*e^(lt) - resonant solutions |
| Numerical linear algebra | Convergence of iterative methods | Spectral radius rho(A) = max|lambda| sets the rate |
**Jordan form is numerically unstable**: a small perturbation to a matrix can merge two distinct eigenvalues into one repeated one, or split a repeated eigenvalue into two. In practice, **Schur form** (A = Q T Q*, T upper triangular) is used instead - equally informative but stable under floating-point errors.
**The theory shows up wherever exact linear dynamics matter** - **RNN / LSTM / GRU**: Analyzing vanishing/exploding gradients. Spectral radius of the weight matrix; orthogonal/identity initialization (Saxe et al. 2013) - **Control systems (autopilot, robotics)**: Stability check: all Re(lambda) < 0. MATLAB Control Toolbox, Python control library, scipy.signal - **Signal processing**: Filters with repeated poles. IIR filters, resonance analysis, scipy.signal.lti - **Numerical methods**: Convergence of iterations (power method, PageRank). Spectral radius sets how fast the method converges
Practice: Jordan Block Powers
**Q:** A 3x3 matrix has eigenvalue lambda = 2 with algebraic multiplicity 3 and geometric multiplicity 1. What is the only possible Jordan form? - One block J_3(2) of size 3 - the only possibility when geometric multiplicity is 1 - If geometric multiplicity were 2: J_2(2) + J_1(2) - Number of blocks = dim(ker(A - lambda*I)) **Q:** Why is orthogonal initialization recommended for RNN weight matrices? - Orthogonal matrix: all |lambda| = 1 - no decay, no explosion - Jordan blocks collapse to diagonal (no nilpotent part) - Gradient preserves its norm through t backprop steps - vanishing problem disappears - This is the Saxe et al. 2013 orthogonal initialization result **Q:** Why is Schur form preferred over Jordan form for numerical computation? - Jordan form is discontinuous: an epsilon perturbation can change the block structure entirely - Schur form A = QTQ* is continuous - eigenvalues sit on the diagonal of T - For stability analysis, only the diagonal of T is needed - scipy.linalg.schur is numerically reliable; jordan_form is only for exact arithmetic
Takeaways from this lesson
- **Defective matrix**: algebraic multiplicity > geometric; diagonalization impossible
- **Jordan block J_k(lambda)**: lambda on the diagonal, ones above - each one marks a missing dimension
- **J_k(lambda)^n** contains terms n*lambda^(n-1), C(n,2)*lambda^(n-2) - polynomial growth at |lambda| = 1
- **RNN**: Jordan structure of the weight matrix governs the vanishing/exploding gradient regime
- **Matrix exponential**: e^(Jt) produces terms t*e^(lt) - resonance from repeated ODE roots
- **In practice**: Schur form instead of Jordan - stable under floating-point errors
- **Orthogonal initialization** in RNNs is a direct application of Jordan theory
What comes next
Jordan form is the limiting case of spectral theory. The following topics generalize it to non-symmetric and rectangular settings.
- Diagonalization — Jordan form is the generalization of diagonalization to defective matrices
- SVD — Universal decomposition for rectangular matrices, avoids defectiveness entirely
- Tensors — The next level - matrices generalized to multiple indices