Linear Algebra
Diagonalization: A Matrix in Its Own Mirror
Diagonalizing a matrix means finding a coordinate system where the transformation is pure scaling along the axes. This makes computing matrix powers, exponentials, and ODE solutions trivial - three problems that appear daily in ML, physics, and control engineering.
- Neural ODEs: solving dh/dt = Ah via e^(At) = V e^(Λt) V⁻¹ - computed through diagonalization
- Markov chains: long-run behavior of P^n - diagonalize the transition matrix
- Physics: normal modes of a vibrating system = eigenvectors of the coupling matrix
- PageRank: power iteration converges to the dominant eigenvector
- Spectral clustering: clusters from eigenvectors of the graph Laplacian
Diagonalization: A Matrix in Its Own Mirror
**The Markov chain behind PageRank walks the entire Internet graph billions of times - and finds a stationary distribution.** Without diagonalization, this would mean multiplying a 10⁹ × 10⁹ matrix over and over. With diagonalization, it collapses into one formula: Aⁿ = PDⁿP⁻¹, where Dⁿ is just individual numbers raised to the nth power. Diagonalization is the art of translating a matrix into a language where multiplication becomes nearly free.
Same matrix. Same computation. Apply it 10 times, 100, a million. In the *wrong* basis the numbers explode or round away. In the *right* basis - the **eigenbasis** - iteration is direct: raise diagonal entries to a power and the work is done. Diagonalization is the art of switching coordinates until the problem is already solved. It is the reason PCA, Markov chains, and linear ODE solvers work at all.
What is the main point of the concept "Diagonalization: A Matrix in Its Own Mirror"?
Comprehension check for the concept.
Geometric meaning: three steps instead of one matrix
Geometric meaning: three steps instead of one matrix
Any linear transformation, viewed in the eigenvector basis, is just stretching along axes. The decomposition A = PDP⁻¹ has a clean geometric reading:
A = P · D · P⁻¹ STEP 1 - P⁻¹: move into the eigenvector basis The vector x is rewritten in coordinates of the eigenvectors. A rotation that aligns the axes with the eigendirections. STEP 2 - D: scale along the axes Each axis is multiplied by its eigenvalue λᵢ. D = diag(λ₁, λ₂, ..., λₙ) - a diagonal matrix. The cheapest of all linear transformations. STEP 3 - P: return to standard coordinates The result is translated back into ordinary coordinates. SUMMARY: A·x = P · diag(λ) · P⁻¹ · x - not one opaque matrix, but three understandable operations
**Key property of the scene**: A·v₁ lies exactly along v₁, A·v₂ along v₂. Direction is unchanged; only length changes. That is the definition of an eigenvector - and therefore in these directions the matrix behaves like multiplication by a single number.
What is the main point of the concept "Geometric meaning: three steps instead of one matrix"?
Comprehension check for the concept.
Full example: A = PDP⁻¹ step by step
Full example: A = PDP⁻¹ step by step
A = [ 4 2 ] [ 1 3 ] 1. EIGENVALUES (from det(A - λI) = 0): det[ 4-λ 2 ] = (4-λ)(3-λ) - 2 = λ² - 7λ + 10 = 0 [ 1 3-λ ] λ₁ = 5, λ₂ = 2 2. EIGENVECTORS: λ₁=5: (A - 5I)v = 0 --> [-1 2][x] = 0 --> x = 2y --> v₁ = [2, 1] λ₂=2: (A - 2I)v = 0 --> [ 2 2][x] = 0 --> x = -y --> v₂ = [1,-1] 3. DECOMPOSITION MATRICES: P = [ 2 1 ] D = [ 5 0 ] P⁻¹ = [ 1/3 1/3 ] [ 1 -1 ] [ 0 2 ] [ 1/3 -2/3 ] 4. VERIFICATION A = PDP⁻¹: P·D = [ 10 2 ] (P·D)·P⁻¹ = [ 4 2 ] = A ✓ [ 5 -2 ] [ 1 3 ]
What is the main point of the concept "Full example: A = PDP⁻¹ step by step"?
Comprehension check for the concept.
Matrix powers: Aⁿ in O(n) instead of O(n³·k)
Matrix powers: Aⁿ in O(n) instead of O(n³·k)
The main practical payoff of diagonalization is cheap matrix exponentiation. This is where PageRank, Markov chains, and many recurrence formulas get their efficiency.
A = PDP⁻¹ A² = (PDP⁻¹)(PDP⁻¹) = PD(P⁻¹P)DP⁻¹ = PD²P⁻¹ A³ = PD²P⁻¹ · PDP⁻¹ = PD³P⁻¹ ... Aⁿ = PDⁿP⁻¹ And Dⁿ is immediate - a diagonal matrix raised to a power: D = diag(λ₁, λ₂, ..., λₖ) Dⁿ = diag(λ₁ⁿ, λ₂ⁿ, ..., λₖⁿ) EXAMPLE - A¹⁰⁰ for our matrix: D¹⁰⁰ = diag(5¹⁰⁰, 2¹⁰⁰) A¹⁰⁰ = P · diag(5¹⁰⁰, 2¹⁰⁰) · P⁻¹ Instead of 100 matrix multiplications - 2 numbers raised to a power.
What is the main point of the concept "Matrix powers: Aⁿ in O(n) instead of O(n³·k)"?
Comprehension check for the concept.
ML application: PageRank and Markov chains
ML application: PageRank and Markov chains
Google's PageRank is the stationary distribution of a Markov chain on the web graph. The stationary distribution is the limit of Aⁿ·x as n goes to infinity. The power iteration algorithm literally multiplies by A again and again - diagonalization explains why it converges and how fast.
A stochastic transition matrix P has: λ₁ = 1 (always - the stationary state) |λ₂|, ..., |λₙ| < 1 (remaining eigenvalues are strictly inside the unit circle) Then Pⁿ as n grows: Pⁿ = P · diag(1ⁿ, λ₂ⁿ, ...) · P⁻¹ --> P · diag(1, 0, 0, ...) · P⁻¹ --> the stationary distribution Convergence rate is determined by |λ₂| - the second largest eigenvalue. Smaller |λ₂| means faster convergence of PageRank.
**Where diagonalization saves the day**: power iteration for the real PageRank needs ~50-100 iterations on a 10⁹ × 10⁹ matrix. Knowing the spectrum (|λ₂| <= 0.85 by Google's design) guarantees convergence. Understanding the spectrum means understanding why anything converges at all.
What is the main point of the concept "ML application: PageRank and Markov chains"?
Comprehension check for the concept.
Spectral decomposition of symmetric matrices: PCA and the Gram matrix
Spectral decomposition of symmetric matrices: PCA and the Gram matrix
Symmetric matrices (A = Aᵀ) have a special property: their eigenvectors are always orthogonal and eigenvalues are always real. This is the basis of the **spectral decomposition** - a special case of diagonalization where P is orthogonal (P⁻¹ = Pᵀ).
For a symmetric A = Aᵀ: A = Q · Λ · Qᵀ where: Q - orthogonal matrix (columns are orthonormal eigenvectors) Λ = diag(λ₁, ..., λₙ) - real eigenvalues Qᵀ = Q⁻¹ (inversion is free!) ADVANTAGES over general diagonalization: - λᵢ are always real (can be sorted in descending order) - Q is orthogonal: inversion = transposition - Geometrically: pure stretch/shrink along orthogonal axes EXAMPLE - the Gram matrix XᵀX is symmetric and positive semi-definite: Xᵀ·X = Q · Λ · Qᵀ All λᵢ >= 0
**Why `np.linalg.eigh` instead of `eig`**: `eigh` knows the matrix is symmetric and uses Jacobi or Lanczos algorithms - 10-100x faster and numerically more accurate. It always returns real values. For symmetric matrices in ML code, always use `eigh`.
What is the main point of the concept "Spectral decomposition of symmetric matrices: PCA and the Gram matrix"?
Comprehension check for the concept.
When diagonalization is impossible
When diagonalization is impossible
Not every matrix is diagonalizable. The condition is having n linearly independent eigenvectors.
| Matrix type | Diagonalizable? | Reason |
|---|---|---|
| Distinct eigenvalues | Yes | Distinct eigenvalues give independent eigenvectors (theorem) |
| Symmetric (A = Aᵀ) | Yes (orthogonally) | Spectral theorem |
| Repeated lambda, geometric = algebraic multiplicity | Yes | Enough independent eigenvectors |
| Repeated lambda, geometric < algebraic | No (Jordan block) | Not enough eigenvectors |
| Rotation matrix (except 0 and 180 degrees) | No (over R) | No real eigenvalues |
**The Jordan block [[1,1],[0,1]]** is the canonical non-diagonalizable example. It has lambda=1 with algebraic multiplicity 2 but only one eigenvector (1,0). The closest analog is the Jordan normal form. In ML practice, such matrices are rare.
Diagonalization in industry
Where spectral methods run right now
| Component | Role | Details |
|---|---|---|
| PCA (Principal Component Analysis) | Spectral decomposition of the covariance matrix | sklearn.decomposition.PCA, TruncatedSVD for sparse data |
| Google PageRank | Stationary vector of the transition matrix = eigenvector at lambda=1 | Power iteration + damping factor 0.85; 50-100 iterations to converge |
| Spectral clustering | Eigenvectors of the graph Laplacian | sklearn.cluster.SpectralClustering; cuts graphs along weak connections |
| Fast matrix exponentiation | Aⁿ = PDⁿP⁻¹ - raise diagonal elements to a power | Markov chains, Fibonacci recurrences, dynamical systems analysis |
| Stability analysis | Sign of Re(lambda) determines system stability | Re(lambda) < 0 stable, Re(lambda) > 0 unstable; used in reinforcement learning |
What is the main point of the concept "When diagonalization is impossible"?
Comprehension check for the concept.
Practice: diagonalization for population dynamics
Practice: diagonalization for population dynamics
Interview questions
How does the spectral decomposition A = QΛQᵀ differ from the general diagonalization A = PDP⁻¹?
- Spectral decomposition applies only to symmetric matrices - Q is orthogonal: Q⁻¹ = Qᵀ, making inversion essentially free - Eigenvalues are guaranteed real and non-negative (for positive semi-definite matrices) - In ML, symmetric matrices are everywhere: XᵀX, covariance matrices, graph Laplacian
Why does PageRank converge, and how does convergence speed relate to eigenvalues?
- A stochastic matrix has lambda-1 = 1; all others satisfy |lambda-i| < 1 - Pⁿ converges to the projection onto the lambda=1 eigenspace (stationary distribution) - Convergence rate equals |lambda-2| - smaller means faster - Google's damping factor 0.85 guarantees |lambda-2| <= 0.85 - 50 iterations suffice
What is the connection between diagonalization and PCA?
- The covariance matrix C = XᵀX/(n-1) is symmetric and positive semi-definite - PCA is the spectral decomposition C = QΛQᵀ - Columns of Q are principal components (directions of maximum variance) - Eigenvalues lambda-i are the variance along each component - Dropping small lambda-i gives data compression with minimal information loss
What is the main point of the concept "Practice: diagonalization for population dynamics"?
Comprehension check for the concept.
Key takeaways
- **A = PDP⁻¹**: P holds eigenvectors as columns, D is diagonal with eigenvalues
- **Aⁿ = PDⁿP⁻¹**: matrix power = scalar powers on the diagonal - the engine of PageRank and Markov chains
- **Symmetric A**: always diagonalizable as A = QΛQᵀ with orthogonal Q; inversion is free
- **PCA** = spectral decomposition of the covariance matrix; each eigenvalue is the explained variance
- **Not diagonalizable**: repeated eigenvalues with a deficit of eigenvectors (Jordan blocks); rare in ML
- **In practice**: `np.linalg.eigh` for symmetric matrices (fast, real output), `eig` for general ones
Connections
Diagonalization bridges algebra and industry
- Eigenvectors — The foundation of diagonalization - impossible without them
- SVD — Generalization of diagonalization to rectangular matrices - the most powerful tool in ML
- PCA — Direct application of spectral decomposition to dimensionality reduction
- Determinant — det(A) = product of all eigenvalues