Linear Algebra

Diagonalization: A Matrix in Its Own Mirror

Diagonalizing a matrix means finding a coordinate system where the transformation is pure scaling along the axes. This makes computing matrix powers, exponentials, and ODE solutions trivial - three problems that appear daily in ML, physics, and control engineering.

Neural ODEs: solving dh/dt = Ah via e^(At) = V e^(Λt) V⁻¹ - computed through diagonalization
Markov chains: long-run behavior of P^n - diagonalize the transition matrix
Physics: normal modes of a vibrating system = eigenvectors of the coupling matrix
PageRank: power iteration converges to the dominant eigenvector
Spectral clustering: clusters from eigenvectors of the graph Laplacian

Diagonalization: A Matrix in Its Own Mirror

**The Markov chain behind PageRank walks the entire Internet graph billions of times - and finds a stationary distribution.** Without diagonalization, this would mean multiplying a 10⁹ × 10⁹ matrix over and over. With diagonalization, it collapses into one formula: Aⁿ = PDⁿP⁻¹, where Dⁿ is just individual numbers raised to the nth power. Diagonalization is the art of translating a matrix into a language where multiplication becomes nearly free.

Same matrix. Same computation. Apply it 10 times, 100, a million. In the *wrong* basis the numbers explode or round away. In the *right* basis - the **eigenbasis** - iteration is direct: raise diagonal entries to a power and the work is done. Diagonalization is the art of switching coordinates until the problem is already solved. It is the reason PCA, Markov chains, and linear ODE solvers work at all.

What is the main point of the concept "Diagonalization: A Matrix in Its Own Mirror"?

Comprehension check for the concept.

Geometric meaning: three steps instead of one matrix

Any linear transformation, viewed in the eigenvector basis, is just stretching along axes. The decomposition A = PDP⁻¹ has a clean geometric reading:

A = P · D · P⁻¹ STEP 1 - P⁻¹: move into the eigenvector basis The vector x is rewritten in coordinates of the eigenvectors. A rotation that aligns the axes with the eigendirections. STEP 2 - D: scale along the axes Each axis is multiplied by its eigenvalue λᵢ. D = diag(λ₁, λ₂, ..., λₙ) - a diagonal matrix. The cheapest of all linear transformations. STEP 3 - P: return to standard coordinates The result is translated back into ordinary coordinates. SUMMARY: A·x = P · diag(λ) · P⁻¹ · x - not one opaque matrix, but three understandable operations

**Key property of the scene**: A·v₁ lies exactly along v₁, A·v₂ along v₂. Direction is unchanged; only length changes. That is the definition of an eigenvector - and therefore in these directions the matrix behaves like multiplication by a single number.

What is the main point of the concept "Geometric meaning: three steps instead of one matrix"?

Comprehension check for the concept.

Full example: A = PDP⁻¹ step by step

A = [ 4 2 ] [ 1 3 ] 1. EIGENVALUES (from det(A - λI) = 0): det[ 4-λ 2 ] = (4-λ)(3-λ) - 2 = λ² - 7λ + 10 = 0 [ 1 3-λ ] λ₁ = 5, λ₂ = 2 2. EIGENVECTORS: λ₁=5: (A - 5I)v = 0 --> [-1 2][x] = 0 --> x = 2y --> v₁ = [2, 1] λ₂=2: (A - 2I)v = 0 --> [ 2 2][x] = 0 --> x = -y --> v₂ = [1,-1] 3. DECOMPOSITION MATRICES: P = [ 2 1 ] D = [ 5 0 ] P⁻¹ = [ 1/3 1/3 ] [ 1 -1 ] [ 0 2 ] [ 1/3 -2/3 ] 4. VERIFICATION A = PDP⁻¹: P·D = [ 10 2 ] (P·D)·P⁻¹ = [ 4 2 ] = A ✓ [ 5 -2 ] [ 1 3 ]

What is the main point of the concept "Full example: A = PDP⁻¹ step by step"?

Comprehension check for the concept.

Matrix powers: Aⁿ in O(n) instead of O(n³·k)

The main practical payoff of diagonalization is cheap matrix exponentiation. This is where PageRank, Markov chains, and many recurrence formulas get their efficiency.

A = PDP⁻¹ A² = (PDP⁻¹)(PDP⁻¹) = PD(P⁻¹P)DP⁻¹ = PD²P⁻¹ A³ = PD²P⁻¹ · PDP⁻¹ = PD³P⁻¹ ... Aⁿ = PDⁿP⁻¹ And Dⁿ is immediate - a diagonal matrix raised to a power: D = diag(λ₁, λ₂, ..., λₖ) Dⁿ = diag(λ₁ⁿ, λ₂ⁿ, ..., λₖⁿ) EXAMPLE - A¹⁰⁰ for our matrix: D¹⁰⁰ = diag(5¹⁰⁰, 2¹⁰⁰) A¹⁰⁰ = P · diag(5¹⁰⁰, 2¹⁰⁰) · P⁻¹ Instead of 100 matrix multiplications - 2 numbers raised to a power.

What is the main point of the concept "Matrix powers: Aⁿ in O(n) instead of O(n³·k)"?

Comprehension check for the concept.

ML application: PageRank and Markov chains

Google's PageRank is the stationary distribution of a Markov chain on the web graph. The stationary distribution is the limit of Aⁿ·x as n goes to infinity. The power iteration algorithm literally multiplies by A again and again - diagonalization explains why it converges and how fast.

A stochastic transition matrix P has: λ₁ = 1 (always - the stationary state) |λ₂|, ..., |λₙ| < 1 (remaining eigenvalues are strictly inside the unit circle) Then Pⁿ as n grows: Pⁿ = P · diag(1ⁿ, λ₂ⁿ, ...) · P⁻¹ --> P · diag(1, 0, 0, ...) · P⁻¹ --> the stationary distribution Convergence rate is determined by |λ₂| - the second largest eigenvalue. Smaller |λ₂| means faster convergence of PageRank.

**Where diagonalization saves the day**: power iteration for the real PageRank needs ~50-100 iterations on a 10⁹ × 10⁹ matrix. Knowing the spectrum (|λ₂| <= 0.85 by Google's design) guarantees convergence. Understanding the spectrum means understanding why anything converges at all.

What is the main point of the concept "ML application: PageRank and Markov chains"?

Comprehension check for the concept.

Spectral decomposition of symmetric matrices: PCA and the Gram matrix

Symmetric matrices (A = Aᵀ) have a special property: their eigenvectors are always orthogonal and eigenvalues are always real. This is the basis of the **spectral decomposition** - a special case of diagonalization where P is orthogonal (P⁻¹ = Pᵀ).

For a symmetric A = Aᵀ: A = Q · Λ · Qᵀ where: Q - orthogonal matrix (columns are orthonormal eigenvectors) Λ = diag(λ₁, ..., λₙ) - real eigenvalues Qᵀ = Q⁻¹ (inversion is free!) ADVANTAGES over general diagonalization: - λᵢ are always real (can be sorted in descending order) - Q is orthogonal: inversion = transposition - Geometrically: pure stretch/shrink along orthogonal axes EXAMPLE - the Gram matrix XᵀX is symmetric and positive semi-definite: Xᵀ·X = Q · Λ · Qᵀ All λᵢ >= 0

**Why `np.linalg.eigh` instead of `eig`**: `eigh` knows the matrix is symmetric and uses Jacobi or Lanczos algorithms - 10-100x faster and numerically more accurate. It always returns real values. For symmetric matrices in ML code, always use `eigh`.

What is the main point of the concept "Spectral decomposition of symmetric matrices: PCA and the Gram matrix"?

Comprehension check for the concept.

When diagonalization is impossible

Not every matrix is diagonalizable. The condition is having n linearly independent eigenvectors.

Matrix type	Diagonalizable?	Reason
Distinct eigenvalues	Yes	Distinct eigenvalues give independent eigenvectors (theorem)
Symmetric (A = Aᵀ)	Yes (orthogonally)	Spectral theorem
Repeated lambda, geometric = algebraic multiplicity	Yes	Enough independent eigenvectors
Repeated lambda, geometric < algebraic	No (Jordan block)	Not enough eigenvectors
Rotation matrix (except 0 and 180 degrees)	No (over R)	No real eigenvalues

**The Jordan block [[1,1],[0,1]]** is the canonical non-diagonalizable example. It has lambda=1 with algebraic multiplicity 2 but only one eigenvector (1,0). The closest analog is the Jordan normal form. In ML practice, such matrices are rare.

Diagonalization in industry

Where spectral methods run right now

Component	Role	Details
PCA (Principal Component Analysis)	Spectral decomposition of the covariance matrix	sklearn.decomposition.PCA, TruncatedSVD for sparse data
Google PageRank	Stationary vector of the transition matrix = eigenvector at lambda=1	Power iteration + damping factor 0.85; 50-100 iterations to converge
Spectral clustering	Eigenvectors of the graph Laplacian	sklearn.cluster.SpectralClustering; cuts graphs along weak connections
Fast matrix exponentiation	Aⁿ = PDⁿP⁻¹ - raise diagonal elements to a power	Markov chains, Fibonacci recurrences, dynamical systems analysis
Stability analysis	Sign of Re(lambda) determines system stability	Re(lambda) < 0 stable, Re(lambda) > 0 unstable; used in reinforcement learning

What is the main point of the concept "When diagonalization is impossible"?

Comprehension check for the concept.

Practice: diagonalization for population dynamics

Interview questions

How does the spectral decomposition A = QΛQᵀ differ from the general diagonalization A = PDP⁻¹?

- Spectral decomposition applies only to symmetric matrices - Q is orthogonal: Q⁻¹ = Qᵀ, making inversion essentially free - Eigenvalues are guaranteed real and non-negative (for positive semi-definite matrices) - In ML, symmetric matrices are everywhere: XᵀX, covariance matrices, graph Laplacian

Why does PageRank converge, and how does convergence speed relate to eigenvalues?

- A stochastic matrix has lambda-1 = 1; all others satisfy |lambda-i| < 1 - Pⁿ converges to the projection onto the lambda=1 eigenspace (stationary distribution) - Convergence rate equals |lambda-2| - smaller means faster - Google's damping factor 0.85 guarantees |lambda-2| <= 0.85 - 50 iterations suffice

What is the connection between diagonalization and PCA?

- The covariance matrix C = XᵀX/(n-1) is symmetric and positive semi-definite - PCA is the spectral decomposition C = QΛQᵀ - Columns of Q are principal components (directions of maximum variance) - Eigenvalues lambda-i are the variance along each component - Dropping small lambda-i gives data compression with minimal information loss

What is the main point of the concept "Practice: diagonalization for population dynamics"?

Comprehension check for the concept.

Key takeaways

**A = PDP⁻¹**: P holds eigenvectors as columns, D is diagonal with eigenvalues
**Aⁿ = PDⁿP⁻¹**: matrix power = scalar powers on the diagonal - the engine of PageRank and Markov chains
**Symmetric A**: always diagonalizable as A = QΛQᵀ with orthogonal Q; inversion is free
**PCA** = spectral decomposition of the covariance matrix; each eigenvalue is the explained variance
**Not diagonalizable**: repeated eigenvalues with a deficit of eigenvectors (Jordan blocks); rare in ML
**In practice**: `np.linalg.eigh` for symmetric matrices (fast, real output), `eig` for general ones

Connections

Diagonalization bridges algebra and industry

Eigenvectors — The foundation of diagonalization - impossible without them
SVD — Generalization of diagonalization to rectangular matrices - the most powerful tool in ML
PCA — Direct application of spectral decomposition to dimensionality reduction
Determinant — det(A) = product of all eigenvalues

Diagonalization: A Matrix in Its Own Mirror

Geometric meaning: three steps instead of one matrix

Geometric meaning: three steps instead of one matrix

Full example: A = PDP⁻¹ step by step

Full example: A = PDP⁻¹ step by step

Matrix powers: Aⁿ in O(n) instead of O(n³·k)

Matrix powers: Aⁿ in O(n) instead of O(n³·k)

ML application: PageRank and Markov chains

ML application: PageRank and Markov chains

Spectral decomposition of symmetric matrices: PCA and the Gram matrix

Spectral decomposition of symmetric matrices: PCA and the Gram matrix

When diagonalization is impossible

When diagonalization is impossible

Diagonalization in industry

Practice: diagonalization for population dynamics

Practice: diagonalization for population dynamics

Interview questions

Key takeaways

Connections

Связанные уроки