Linear Algebra
Cholesky Decomposition and Schur Form
Gaussian elimination on a covariance matrix can produce negative eigenvalues due to floating-point errors, crashing your Kalman filter. Cholesky decomposition is the solution: it exists if and only if the matrix is positive definite, and it tells you so by failing gracefully rather than silently corrupting results.
- Kalman filters (GPS, robotics, finance): the covariance update $P = FPF^T + Q$ must stay positive definite. Cholesky decomposition is used both to check this and to solve the linear systems efficiently in $O(n^3/3)$ instead of $O(n^3)$.
- Gaussian process regression (scikit-learn GPR): fitting a GP requires solving $K^{-1}y$ where $K$ is a covariance matrix. The implementation uses Cholesky: $K = LL^T$, then solves two triangular systems - faster and numerically stable.
- Monte Carlo simulation of correlated assets: to sample $x \sim \mathcal{N}(0, \Sigma)$, compute $L = \mathrm{chol}(\Sigma)$ and set $x = Lz$ where $z \sim \mathcal{N}(0, I)$. This is exact and runs in $O(n^2)$ per sample after the $O(n^3)$ one-time factorization.
Цели урока
- Compute Cholesky decomposition $A = LL^T$ and use it to solve linear systems efficiently
- State the Schur decomposition theorem and connect it to the spectral theorem for normal matrices
- Identify when a matrix is normal and explain what the Schur form reveals about its spectrum
Предварительные знания
- Eigenvalues and eigenvectors, diagonalization
- Positive definite matrices and their characterizations
- QR decomposition and Gram-Schmidt orthogonalization
Cholesky decomposition: the SPD workhorse
For a symmetric positive definite (SPD) matrix $A$, the Cholesky decomposition $A = LL^T$ exists and is unique when $L$ has positive diagonal entries. The algorithm proceeds column by column: $L_{jj} = \sqrt{A_{jj} - \sum_{k<j} L_{jk}^2}$ and $L_{ij} = (A_{ij} - \sum_{k<j} L_{ik}L_{jk})/L_{jj}$ for $i > j$. Failure (square root of a negative number) proves $A$ is not positive definite - a feature, not a bug.
Schur decomposition and normal matrices
Every square matrix $A$ has a Schur decomposition $A = QUQ^*$ where $Q$ is unitary and $U$ is upper triangular. The diagonal entries of $U$ are the eigenvalues of $A$. This always exists (over $\mathbb{C}$) unlike diagonalization. A matrix is normal if $AA^* = A^*A$. For normal matrices, the Schur form is diagonal: $U = \Lambda$, recovering the spectral theorem. Hermitian (symmetric real) matrices are normal with real eigenvalues.
The Schur decomposition underlies numerical eigenvalue algorithms. The QR algorithm (Francis, 1961) iteratively applies QR decompositions to drive $A$ toward upper triangular form - convergence is guaranteed and the diagonal entries converge to eigenvalues. LAPACK's `dgehrd`/`dhseqr` routines implement this for general matrices.
Cholesky decomposition is only defined for symmetric positive definite matrices. A common pitfall: numerically computed covariance matrices $\hat{\Sigma} = X^TX/n$ may have tiny negative eigenvalues due to floating-point rounding when $n < p$ (more features than samples). Fix: add a small regularization $\hat{\Sigma} + \varepsilon I$ before Cholesky.
André-Louis Cholesky (1875-1918) developed his decomposition while working on ge
André-Louis Cholesky (1875-1918) developed his decomposition while working on geodetic surveying in Crete for the French military. He died in WWI before publishing; his method was posthumously published by his colleague Benoit in 1924. The Schur decomposition is named after Issai Schur (1909), whose work on matrix theory influenced quantum mechanics - the Schur lemma is fundamental in representation theory.
Cholesky Decomposition
Stan (Harvard's statistical compiler) and TensorFlow Probability use Cholesky decomposition for all multivariate normal sampling. The Newton-CG optimizer in scipy relies on it too. In 2023, DeepMind's AlphaFold 3 applied Cholesky to covariance matrices of size 10^5 x 10^5.
For which class of matrices does Cholesky decomposition always exist?
Cholesky requires A = A^T with A > 0. Existence is checked by attempting the factorization: a non-positive diagonal entry signals failure.
Schur Decomposition and Normal Matrices
The Schur form underpins modern eigenvalue algorithms. LAPACK dgees (used by MATLAB and Julia) computes the Schur form for general matrices. The numerical stability of the algorithm follows from the fact that the similarity transformation Q is orthogonal.
What is the necessary and sufficient condition for the Schur form of A to be diagonal?
T diagonal iff A normal: A*A = AA*. The class of normal matrices includes unitary, symmetric, and skew-symmetric matrices.
Solving a Bayesian linear regression system
In Bayesian linear regression the posterior precision is $A = X^TX/\sigma^2 + \Lambda_0^{-1}$ (SPD). To compute the posterior mean $\mu = A^{-1}b$: (1) compute $L = \mathrm{chol}(A)$ in $O(n^3/3)$; (2) solve $Ly_1 = b$ by forward substitution in $O(n^2)$; (3) solve $L^Ty_2 = y_1$ by back substitution in $O(n^2)$. Total: half the flops of LU decomposition, with guaranteed numerical stability for SPD matrices.
Итоги
- Cholesky $A = LL^T$ is the fastest and most numerically stable solver for SPD systems, used in Kalman filters, Gaussian processes, and Monte Carlo sampling.
- Schur decomposition $A = QUQ^*$ always exists and reduces to the spectral theorem for normal matrices.
- Normality ($AA^* = A^*A$) is the condition that guarantees orthogonal diagonalizability.
Connections to other topics
Positive definite matrices (la-21) are exactly the matrices where Cholesky succeeds. The spectral theorem (la-22) is the special case of Schur decomposition for symmetric matrices. In optimization, SPD Hessians guarantee convexity - Cholesky is used to check this and to run Newton's method efficiently.
- La 21 — related
- La 22 — related
Вопросы для размышления
- The Cholesky factor $L$ satisfies $|\det A| = (\prod_i L_{ii})^2$. This gives a fast way to compute the log-determinant of a covariance matrix - a quantity that appears in the log-likelihood of a multivariate Gaussian. Why is computing $\log|\Sigma|$ via Cholesky more numerically stable than via eigendecomposition?