Linear Algebra
QR Decomposition: Orthogonality as a Superpower
Solving linear regression via the normal equations (XᵀX)w = Xᵀy squares the condition number: κ(XᵀX) = κ(X)². With κ(X) = 10⁷, that is catastrophic for float64. numpy.linalg.lstsq uses QR decomposition precisely to avoid this - working directly with X instead of XᵀX.
- numpy.linalg.lstsq: internally uses QR - numerically twice as stable as the normal equations
- Transformer weight initialization: QR orthogonalization of weight matrices
- QR iteration: the standard algorithm for computing all eigenvalues of a matrix
- Ridge regression: solving via QR without loss of numerical precision
- Recursive least squares: online QR update for streaming data
**Linear regression on 100 000 samples.** The textbook says: w = (X^TX)^{-1}X^Ty. NumPy does something else entirely. `np.linalg.lstsq` computes a QR decomposition internally and never forms X^TX. The reason is numerical: squaring the condition number of X turns a manageable problem into an unstable one. QR sidesteps this. That is why, over 60 years, QR decomposition became the industry standard - not for elegance, but for stability on real data.
Numerical engineers share an instinct: **when in doubt, reach for an orthogonal transformation**. They don't amplify floating-point errors. They preserve vector lengths. They are always invertible. **QR** is the formal version of that instinct: *any matrix = something well-behaved (Q) times something simple (R)*. Which is why sklearn's LinearRegression, scipy's eigensolver, and every stable LLM fine-tuner reach for QR underneath.
What QR Decomposition Is
Any m x n matrix A (m >= n) can be written as a product of two matrices with special structure:
**The key property of Q**: multiplying by an orthogonal matrix preserves vector lengths and angles. This is exactly why QR is numerically stable - Q does not amplify rounding errors, it just rotates them.
Gram-Schmidt: Building Q by Hand
The classical algorithm for computing Q is the Gram-Schmidt process: take the columns of A one by one and progressively make them orthonormal.
Classical Gram-Schmidt loses orthogonality on poorly conditioned matrices due to accumulated rounding errors. In practice, NumPy uses Householder reflections - they are numerically superior.
Householder: How NumPy Actually Does It
Production QR implementations use Householder reflections. The idea: a sequence of reflections H1, H2, ... zeros out the below-diagonal entries column by column, turning A into upper triangular R.
ML Application 1: Why lstsq Beats the Normal Equation
The normal equation for linear regression is w = (X^TX)^{-1}X^Ty. Elegant, but dangerous. If X has condition number kappa, then X^TX has kappa^2. Poorly conditioned data - multicollinearity - turns into catastrophically bad coefficients.
sklearn's LinearRegression also uses lstsq via scipy, which calls LAPACK's dgelsd (SVD-based) or dgelsy (QR-based). The explicit (X^TX)^{-1} formula is a textbook artifact - no production code uses it.
ML Application 2: QR Iteration for Eigenvalues
scipy.linalg.eig finds all eigenvalues of a square matrix. Under the hood: the QR iteration algorithm. Successive QR decompositions drive the matrix toward Schur form - a quasi-upper-triangular matrix where eigenvalues appear on the diagonal.
In practice, QR with shifts (Francis algorithm) is used - this accelerates convergence from cubic to near-quadratic for symmetric matrices. It has been the standard since 1961 and is implemented in LAPACK, NumPy, and MATLAB.
ML Application 3: Orthogonal Weight Initialization
When initializing deep network weights from a normal distribution, there is a compounding problem: matrices multiply across layers, and if singular values exceed 1, gradients explode; if they fall below 1, gradients vanish. Orthogonal initialization fixes this - since Q * Q^T = I, all singular values are exactly 1.
Orthogonal initialization matters most for recurrent networks (LSTM, GRU), where the weight matrix is applied hundreds of times per sequence. Xavier/He initialization controls variance but makes no guarantee about singular values. QR initialization delivers both properties.
LU vs QR: When to Use Which
Where QR decomposition is running right now Components: numpy.linalg.lstsq, scipy.linalg.eig / eigh, torch.nn.init.orthogonal_, sklearn LinearRegression
Practice: Least Squares Fit
- Why does numpy.linalg.lstsq avoid the normal equation (X^TX)^{-1}X^Ty? - Why does swapping factors from Q*R to R*Q in QR iteration drive the matrix toward Schur form? - When does orthogonal initialization matter more than Xavier/He initialization?
QR Factorization and Orthogonalization
Eigen (C++) and NumPy solve least squares via QR -- used by every regression model and OLS estimator. Google PageRank and JPEG compression rely on QR decomposition. Running on billions of devices daily.
Why is QR factorization more numerically stable than the normal equations A^T A x = A^T b?
Normal equations square the condition number: kappa(A^T A) = kappa(A)^2. QR works directly with A, preserving kappa(A).
QR Iteration for Eigenvalues
LAPACK dsyev -- the standard for symmetric eigenvalue problems -- uses QR iteration. MATLAB eig(), numpy.linalg.eig() -- all run on QR. Every quantum chemistry package (Gaussian, ORCA) uses this algorithm for molecular orbital computations.
Why do all matrices A_k in QR iteration share the same eigenvalues?
A_{k+1} = Q_k^T A_k Q_k is a similarity transform. Similar matrices share the same characteristic polynomial and hence the same eigenvalues.
A = Q * R A - original m x n matrix Q - orthogonal m x m matrix: Q^T * Q = I R - upper triangular m x n matrix Example for a 3 x 2 matrix: A = | 1 1 | Q = | 1/sqrt(2) 1/sqrt(6) | R = | sqrt(2) 1/sqrt(2) | | 1 0 | | 1/sqrt(2) -1/sqrt(6) | | 0 sqrt(3/2) | | 0 1 | | 0 2/sqrt(6) | Q^T * Q = I (columns of Q are orthonormal)
Given: columns a1, a2, ..., an of matrix A Step 1: e1 = a1 / ||a1|| Step 2: u2 = a2 - (a2 . e1) * e1 // subtract projection onto e1 e2 = u2 / ||u2|| Step k: uk = ak - sum_{i<k} (ak . ei) * ei ek = uk / ||uk|| Matrix Q = [e1 | e2 | ... | en] Matrix R: R[i][j] = ai . ej (dot products) Why it works: each new vector is the old column minus its projections onto the already-built orthonormal set. The remainder is perpendicular to everything before it.
Reflection matrix: H = I - 2 * v*v^T / (v^T * v) Where v is chosen so that H*a = -sign(a1)*||a||*e1 (reflects vector a onto the first coordinate axis) Process: H1 * A zeros out everything below [1,1] H2 * H1 * A zeros out everything below [2,2] (in submatrix 2:n, 2:n) ... Hk * ... * H1 * A = R (upper triangular) Then Q = H1 * H2 * ... * Hk (each H is orthogonal) Key advantage: each H is orthogonal => rounding error accumulates as O(eps*n), not O(eps*n^2) as in classical Gram-Schmidt.
Problem: min ||Xw - y||^2 Normal equation (DANGEROUS with multicollinearity): X^T X w = X^T y => kappa(X^T X) = kappa(X)^2 Via QR (SAFE): X = QR Q^T Q R w = Q^T y R w = Q^T y => kappa(R) = kappa(X) R is upper triangular - solved by back substitution in O(n^2). Condition number is NOT squared.
Given: matrix A0 Iteration k: Ak = Qk * Rk // QR decomposition of the current matrix A(k+1) = Rk * Qk // swap the factors As k -> infinity: Ak -> upper triangular (Schur form) Eigenvalues = diagonal entries Eigenvectors = columns of Q1 * Q2 * ... * Qk Why it works: A(k+1) = Rk * Qk = Qk^T * Ak * Qk This is an orthogonal similarity transformation - same eigenvalues. Iterations progressively isolate invariant subspaces, like a power method running for all eigenvectors simultaneously.
| Task | Method | Why |
|---|---|---|
| Solve Ax = b (square A) | LU | Faster: O(n^3) with smaller constant |
| Linear regression / lstsq | QR | Does not square kappa |
| All eigenvalues | QR iteration | The only production-grade method |
| A few principal components | Truncated SVD | Direct access to singular vectors |
| Neural network weight init | QR | Guarantees sv = 1.0 |
Key Takeaways
- **A = QR**: Q orthogonal (Q^T Q = I), R upper triangular - the entire formula
- **Orthogonal matrices do not amplify errors** - multiplying by Q rotates rounding errors, never scales them
- **lstsq via QR**: kappa(R) = kappa(X), not kappa(X)^2 as with the normal equation
- **QR iteration** is the production standard for finding all eigenvalues (scipy.linalg.eig)
- **Orthogonal weight initialization** via QR guarantees sv = 1.0 at the start of training
- Householder reflections are numerically superior to Gram-Schmidt - they are what LAPACK/NumPy actually use
What Comes Next
QR is a building block for more powerful decompositions
- SVD — Singular value decomposition is built through two QR iterations; generalizes QR to rectangular matrices
- Eigenvalues — QR iteration is the main eigensolver algorithm; Schur form is the intermediate result
- Numerical Stability — Condition number and loss of precision - the main reason to prefer QR over LU