Linear Algebra
Iterative Methods: Conjugate Gradient and Krylov
A matrix with 10 million equations does not fit in RAM for direct LU factorization. Iterative methods solve such systems using only matrix-vector products. Conjugate Gradient underlies physics simulations, FEM solvers, and optimizers throughout scientific computing and ML.
- FEM physics: 10M-equation stiffness matrix solved by preconditioned CG in 100 iterations
- Google PageRank: power iteration - the simplest iterative eigenvector method
- PyTorch L-BFGS: quasi-Newton optimizer uses matrix-vector products with the Hessian
- Computer graphics: preconditioned CG for real-time deformable objects
- Spectral clustering: ARPACK finds eigenvalues of the Laplacian iteratively
Iterative methods in ML and science: **CG** , solving normal equations in large-scale linear regression, PDE-constrained optimization; **GMRES** , computational fluid dynamics (CFD), electromagnetic simulation; **PageRank** , power iteration (a special case of Krylov method); **Krylov in neural networks** , K-FAC (Kronecker-Factored Approximate Curvature) uses CG for Hessian-vector products in second-order optimization.
Jacobi and Gauss-Seidel
**Conjugate Gradient solves Ax=b with n=10⁶ equations in √κ(A) iterations , Google PageRank (n=3×10¹⁰) relies on exactly these iterative methods.** Direct methods (LU, QR) require O(n³) operations and dense matrix storage , impossible for million-scale systems. Iterative methods only need matrix-vector products: for sparse matrices that is O(nnz) operations instead of O(n²).
Stationary Iterative Methods
Before CG, stationary methods existed: Jacobi, Gauss-Seidel, SOR. They converge more slowly (linearly with spectral radius ρ(M)), but are simple to implement and useful as preconditioners.
How does Gauss-Seidel differ from Jacobi?
Jacobi updates each x_i from the previous x^(k). Gauss-Seidel uses the freshly computed x_1,...,x_{i-1} from the current iteration when updating x_i. This roughly doubles convergence speed and removes the need to store the old vector.
GMRES and Krylov Subspaces
Krylov Subspace Methods and GMRES
CG works only for SPD matrices. For non-symmetric systems, GMRES (Generalized Minimal RESidual) is used , it minimizes the residual in the Krylov subspace. At each iteration GMRES orthogonalizes the new vector A^k b against the already-built basis (Arnoldi process) and finds the best approximation.
What subspace does a Krylov method build after k iterations?
The Krylov subspace K_k(A, b) = span(b, Ab, ..., A^(k-1) b). GMRES finds the approximation x_k in this subspace that minimises the residual ‖b - Ax_k‖. Convergence often occurs in k << n iterations on well-conditioned problems.
Conjugate Gradient
Conjugate Gradient Method (CG)
For a symmetric positive-definite matrix A, solving Ax=b is equivalent to minimizing the quadratic function f(x). The conjugate gradient method is an optimization algorithm that chooses search directions that are A-orthogonal (conjugate): dᵢᵀAdⱼ = 0 when i≠j. This guarantees convergence in at most n steps in exact arithmetic.
Preconditioning
Preconditioning is the key technique for accelerating iterative methods. Instead of Ax=b, solve M⁻¹Ax=M⁻¹b where M≈A but M⁻¹ is cheap to compute. The ideal preconditioner: M=A, then κ=1 and convergence in 1 step. Practical preconditioners: incomplete Cholesky (IC) or incomplete LU (ILU).
Convergence Rate and Spectrum
CG convergence speed depends on the spectrum of matrix A. When eigenvalues are clustered (few distinct values), convergence is faster than the theoretical √κ bound. This is why CG for physics problems (eigenvalues cluster naturally) converges much faster than for artificial matrices with a uniform spectrum.
Iterative methods in ML and science: **CG** , solving normal equations in large-scale linear regression, PDE-constrained optimization; **GMRES** , computational fluid dynamics (CFD), electromagnetic simulation; **PageRank** , power iteration (a special case of Krylov method); **Krylov in neural networks** , K-FAC (Kronecker-Factored Approximate Curvature) uses CG for Hessian-vector products in second-order optimization.
For which class of matrices does CG provably converge in n steps in exact arithmetic?
CG minimises the quadratic form (1/2) xᵀAx - bᵀx, which is well-defined only for SPD matrices. In exact arithmetic CG converges in at most n steps; in practice the rate is ~sqrt(cond(A)) with preconditioning.