Linear Algebra
Systems of linear equations: from school to TensorFlow
Linear regression, the Kalman filter, least squares fitting - they all reduce to solving a system of linear equations. Knowing when a system has one solution, none, or infinitely many is knowing the foundation of half the methods in machine learning.
- Linear regression: normal equations (XᵀX)w = Xᵀy - a linear system for the weights
- Computer graphics: finding a ray-plane intersection - a 3-equation system
- Electrical engineering: Kirchhoff's laws for a circuit with n nodes - n equations
- Economics: Walrasian market equilibrium - a large linear system
- GPS: position from four satellite range equations
Systems of linear equations: from school to TensorFlow
**Every second, numpy.linalg.solve is called millions of times worldwide** - inside physics simulators, financial models, neural networks, and computer vision pipelines. Behind every call is exactly one equation: **Ax = b**. The transformation A is known, the result b is known, the unknown is x. A school system of equations is the same thing written row by row. The matrix form allows solving it in O(n²) instead of O(n³) and makes it readily parallelizable on a GPU.
Open a Jupyter notebook, type `np.linalg.solve(A, b)` - one line, done. Today that line gets cracked open. Three practical questions drop out: why `inv(A) @ b` is the slow, dangerous cousin of `solve`, what to actually do when a real ML dataset has 90% more equations than unknowns, and how conjugate gradient lets a matrix be solved without ever fitting it into RAM.
What is the key idea of the concept 'Systems of linear equations: from school to TensorFlow'?
Check that the concept material has been understood.
Three forms of the same equation
Three forms of the same equation
1. CLASSICAL (school-style): 2x + 3y - z = 5 x - y + 2z = 1 3x + y + z = 4 2. MATRIX FORM: A * x = b | 2 3 -1 | | x | | 5 | | 1 -1 2 | * | y | = | 1 | | 3 1 1 | | z | | 4 | 3. COLUMN FORM (linear combination of columns): | 2 | | 3 | | -1 | | 5 | x * | 1 | + y * | -1 | + z * | 2 | = | 1 | | 3 | | 1 | | 1 | | 4 | The column form asks: "Can b be written as a linear combination of A's columns?" The coefficients x, y, z are exactly the solution.
**The column picture is more powerful**: it connects directly to matrix rank and the concept of column space. If b lies in the span of A's columns - a solution exists. If not - no solution exists. This is the geometric foundation of the consistency theory for linear systems.
What is the key idea of the concept 'Three forms of the same equation'?
Check that the concept material has been understood.
Number of solutions: three cases
Number of solutions: three cases
| Case | Geometry (3D) | Condition | In ML |
|---|---|---|---|
| One solution | Three planes meet at a point | det(A) != 0, rank = n | Exact fit (rare) |
| Infinitely many | Planes intersect along a line | det(A) = 0, b in image of A | Underdetermined (n < d features) |
| No solutions | Planes have no common point | b outside image of A | Overdetermined - use lstsq! |
What is the key idea of the concept 'Number of solutions: three cases'?
Check that the concept material has been understood.
np.linalg.solve vs inv: why inv is slower
np.linalg.solve vs inv: why inv is slower
The naive approach to solving Ax = b is to compute A⁻¹ and multiply: x = A⁻¹ * b. This is the **wrong** approach in production code. np.linalg.solve(A, b) is faster and more accurate because it uses LU decomposition directly, without ever forming the inverse matrix.
**Why solve is faster**: np.linalg.inv builds the full inverse matrix in O(n^3), then multiplies A^-1 * b in O(n^2). np.linalg.solve performs LU decomposition with pivoting and solves in a single pass - no extra matrix-matrix multiplication and fewer rounding errors. The underlying LAPACK function is dgesv. When multiple right-hand sides b1...b10 are needed, solve(A, B) reuses the same LU factorization for all of them - another win.
What is the key idea of the concept 'np.linalg.solve vs inv: why inv is slower'?
Check that the concept material has been understood.
Overdetermined systems: lstsq and the normal equation
Overdetermined systems: lstsq and the normal equation
In ML there are almost always **more equations than unknowns**: 10000 training examples, 100 features. An exact solution does not exist - b does not lie in the column space of A. The fix: find x* that **minimizes** ||Ax - b||². This is the ~least squares~{least squares: min ||Ax - b||²} problem.
Problem: min ||Ax - b||^2 for A of size (m x n), m > n Theorem: the minimum is achieved at A^T A * x* = A^T b If A^T A is invertible (true when A has full column rank): x* = (A^T A)^-1 A^T b THIS IS exactly linear regression! X = feature matrix (n_samples x n_features) y = label vector w* = (X^T X)^-1 X^T y PROBLEM with the normal equation: X^T X can be ill-conditioned (kappa >> 1) -> use scipy.linalg.lstsq or sklearn LinearRegression (internally they use SVD, not explicit inversion)
**When lstsq beats the normal equation**: always, when features are potentially collinear or the feature matrix is ill-conditioned. lstsq uses SVD internally - even for rank-deficient matrices it returns the minimum-norm solution ||x*||. sklearn LinearRegression does exactly the same thing.
What is the key idea of the concept 'Overdetermined systems: lstsq and the normal equation'?
Check that the concept material has been understood.
LU, Cholesky, QR: choosing the right algorithm
LU, Cholesky, QR: choosing the right algorithm
np.linalg.solve picks an algorithm automatically, but knowing which one it uses helps diagnose problems and select scipy functions explicitly for better performance.
| Algorithm | When to use | Complexity | Example |
|---|---|---|---|
| LU (DGESV) | General square matrix | O(n^3/3) | np.linalg.solve by default |
| Cholesky | Symmetric positive definite | O(n^3/6) | scipy.linalg.cho_solve - Ridge, GP |
| QR | Overdetermined systems | O(mn^2) | scipy.linalg.lstsq for m >> n |
| SVD | Rank-deficient, pseudoinverse | O(mn^2) | np.linalg.lstsq internally, LoRA |
What is the key idea of the concept 'LU, Cholesky, QR: choosing the right algorithm'?
Check that the concept material has been understood.
Conjugate Gradient: when the matrix does not fit in memory
Conjugate Gradient: when the matrix does not fit in memory
LU decomposition requires O(n^3) operations and O(n^2) memory. At n = 10^6 (typical for graph problems, PDEs, large sparse systems) this is infeasible. The ~conjugate gradient method~{conjugate gradient: iterative method for SPD systems} (CG) solves Ax = b iteratively: each step costs O(n) multiplications and it converges in at most n steps.
Problem: min f(x) = (1/2) x^T A x - b^T x Gradient: grad f(x) = Ax - b = 0 -> this is Ax = b Solving the system IS the same as minimizing f(x)! CG treats it as an optimization problem. EACH STEP: r_k = b - A * x_k (residual) p_k = -r_k + beta * p_{k-1} (conjugate direction) x_{k+1} = x_k + alpha_k * p_k WHERE IT IS USED: scipy.sparse.linalg.cg(A, b) - sparse SPD matrices PyTorch L-BFGS optimizer - NN training with approximate CG Gaussian Process posterior - GPyTorch uses CG instead of Cholesky
**Preconditioned CG in neural network optimization**: plain gradient descent is CG with zero memory of past directions. L-BFGS and Hessian-free optimization use approximations of conjugate directions. This is why L-BFGS in sklearn.LogisticRegression converges faster than SGD on tabular data - it accounts for the curvature of the loss surface.
What is the key idea of the concept 'Conjugate Gradient: when the matrix does not fit in memory'?
Check that the concept material has been understood.
Homogeneous systems: the null space of a matrix
Homogeneous systems: the null space of a matrix
A ~homogeneous system~{homogeneous system: Ax = 0} asks "which vectors does A map to zero?". The set of all such vectors is called the ~null space~{null space / kernel} of A.
Ax = 0 always has the trivial solution x = 0. IF det(A) != 0 -> only x = 0 (null space is trivial) IF det(A) = 0 -> nontrivial null space (infinitely many vectors) GEOMETRY: Projection onto X-axis: null space = entire Y-axis Rotation: null space = only {0} ML CONNECTION: The null space of a weight matrix in a neural network = directions the layer cannot see at all. A nontrivial null space is why underdetermined systems have infinitely many solutions - regularization (L2 / L1 / dropout) picks one specific solution.
Ax = b in ML infrastructure
Where solving systems is happening right now
| Component | Role | Details |
|---|---|---|
| sklearn LinearRegression | lstsq via SVD under the hood | Normal equation X^T X w = X^T y, solved via scipy.linalg.lstsq with rcond threshold |
| Ridge / Lasso Regression | (X^T X + lambda I)w = X^T y - always SPD, Cholesky | Ridge: analytic, Lasso: ADMM / coordinate descent (cannot be solved as a plain linear system) |
| Gaussian Process posterior | mu* = K(X*,X)(K(X,X)+sigma^2 I)^-1 y - SPD, CG in GPyTorch | GPyTorch uses CG with preconditioning instead of Cholesky - O(n) memory vs O(n^2) |
| Kalman filter | State prediction: P = A P A^T + Q (SPD, Cholesky) | Drones, object tracking, SLAM - hundreds of Ax=b solves per second in real time |
What is the key idea of the concept 'Homogeneous systems: the null space of a matrix'?
Check that the concept material has been understood.
Practice: solve a linear system
Practice: solve a linear system
Interview questions
Why is np.linalg.solve(A, b) better than np.linalg.inv(A) @ b? When is inv actually needed?
- solve uses LU directly - O(n^3/3), never builds the inverse matrix - inv(A) @ b = two steps: O(n^3) + O(n^2) - extra multiplications and rounding - With 100 different b: solve(A, B) runs LU once and does 100 forward-back substitutions O(n^2) - inv is needed only when A^-1 itself is an object of interest - rare in practice
A dataset has 10000 samples and 500 features. The matrix X^T X is ill-conditioned. How to solve linear regression reliably?
- np.linalg.lstsq(X, y) uses SVD internally - stable for any rank - Ridge adds lambda*I: (X^T X + lambda I)w = X^T y - matrix becomes SPD with kappa = (sigma_max^2+lambda)/(sigma_min^2+lambda) - Small lambda dramatically improves conditioning: sigma_min^2=0.001, lambda=0.1 reduces kappa by 100x - sklearn Ridge internally calls scipy.linalg.solve_triangular after Cholesky factorization
When is conjugate gradient preferred over direct methods like LU or Cholesky?
- When n > 10^4 and the matrix is sparse: LU needs O(n^2) memory, CG needs O(n) - Sparse matrices (graphs, PDEs): matrix-vector product A*v is cheap, CG converges in sqrt(kappa) iterations - In neural network optimization: L-BFGS is CG with approximate Hessian information - Direct solvers needed when kappa < 10^6 and storing the factorization is feasible
What is the key idea of the concept 'Practice: solve a linear system'?
Check that the concept material has been understood.
Key takeaways
- **Ax = b** - one equation behind three notations; matrix form unlocks GPU and library solvers
- **np.linalg.solve** is faster and more accurate than inv(A) @ b - always use it when solving a system
- **Overdetermined systems** (m > n) have no exact solution; lstsq minimizes ||Ax - b||^2 via SVD
- **Normal equation** A^T A x = A^T b - this is all of linear regression in one formula
- **Algorithm choice**: LU - general, Cholesky - SPD (2x faster), QR/SVD - overdetermined, CG - sparse/huge
- **Null space** = directions the system cannot see; underdetermined systems need regularization to pick one solution
Where to go next
Systems of equations sit at the intersection of all of linear algebra
- Matrix properties — Conditioning and SPD-ness determine which solver to choose
- Inverse matrix — A^-1 exists <=> det(A) != 0 <=> Ax=b has a unique solution
- SVD and pseudoinverse — lstsq = A+ b via SVD; pseudoinverse generalizes solution to rank-deficient cases