Linear Algebra

Systems of linear equations: from school to TensorFlow

Linear regression, the Kalman filter, least squares fitting - they all reduce to solving a system of linear equations. Knowing when a system has one solution, none, or infinitely many is knowing the foundation of half the methods in machine learning.

Linear regression: normal equations (XᵀX)w = Xᵀy - a linear system for the weights
Computer graphics: finding a ray-plane intersection - a 3-equation system
Electrical engineering: Kirchhoff's laws for a circuit with n nodes - n equations
Economics: Walrasian market equilibrium - a large linear system
GPS: position from four satellite range equations

Systems of linear equations: from school to TensorFlow

**Every second, numpy.linalg.solve is called millions of times worldwide** - inside physics simulators, financial models, neural networks, and computer vision pipelines. Behind every call is exactly one equation: **Ax = b**. The transformation A is known, the result b is known, the unknown is x. A school system of equations is the same thing written row by row. The matrix form allows solving it in O(n²) instead of O(n³) and makes it readily parallelizable on a GPU.

Open a Jupyter notebook, type `np.linalg.solve(A, b)` - one line, done. Today that line gets cracked open. Three practical questions drop out: why `inv(A) @ b` is the slow, dangerous cousin of `solve`, what to actually do when a real ML dataset has 90% more equations than unknowns, and how conjugate gradient lets a matrix be solved without ever fitting it into RAM.

What is the key idea of the concept 'Systems of linear equations: from school to TensorFlow'?

Check that the concept material has been understood.

Three forms of the same equation

1. CLASSICAL (school-style): 2x + 3y - z = 5 x - y + 2z = 1 3x + y + z = 4 2. MATRIX FORM: A * x = b | 2 3 -1 | | x | | 5 | | 1 -1 2 | * | y | = | 1 | | 3 1 1 | | z | | 4 | 3. COLUMN FORM (linear combination of columns): | 2 | | 3 | | -1 | | 5 | x * | 1 | + y * | -1 | + z * | 2 | = | 1 | | 3 | | 1 | | 1 | | 4 | The column form asks: "Can b be written as a linear combination of A's columns?" The coefficients x, y, z are exactly the solution.

**The column picture is more powerful**: it connects directly to matrix rank and the concept of column space. If b lies in the span of A's columns - a solution exists. If not - no solution exists. This is the geometric foundation of the consistency theory for linear systems.

What is the key idea of the concept 'Three forms of the same equation'?

Check that the concept material has been understood.

Number of solutions: three cases

Case	Geometry (3D)	Condition	In ML
One solution	Three planes meet at a point	det(A) != 0, rank = n	Exact fit (rare)
Infinitely many	Planes intersect along a line	det(A) = 0, b in image of A	Underdetermined (n < d features)
No solutions	Planes have no common point	b outside image of A	Overdetermined - use lstsq!

What is the key idea of the concept 'Number of solutions: three cases'?

Check that the concept material has been understood.

np.linalg.solve vs inv: why inv is slower

The naive approach to solving Ax = b is to compute A⁻¹ and multiply: x = A⁻¹ * b. This is the **wrong** approach in production code. np.linalg.solve(A, b) is faster and more accurate because it uses LU decomposition directly, without ever forming the inverse matrix.

**Why solve is faster**: np.linalg.inv builds the full inverse matrix in O(n^3), then multiplies A^-1 * b in O(n^2). np.linalg.solve performs LU decomposition with pivoting and solves in a single pass - no extra matrix-matrix multiplication and fewer rounding errors. The underlying LAPACK function is dgesv. When multiple right-hand sides b1...b10 are needed, solve(A, B) reuses the same LU factorization for all of them - another win.

What is the key idea of the concept 'np.linalg.solve vs inv: why inv is slower'?

Check that the concept material has been understood.

Overdetermined systems: lstsq and the normal equation

In ML there are almost always **more equations than unknowns**: 10000 training examples, 100 features. An exact solution does not exist - b does not lie in the column space of A. The fix: find x* that **minimizes** ||Ax - b||². This is the ~least squares~{least squares: min ||Ax - b||²} problem.

Problem: min ||Ax - b||^2 for A of size (m x n), m > n Theorem: the minimum is achieved at A^T A * x* = A^T b If A^T A is invertible (true when A has full column rank): x* = (A^T A)^-1 A^T b THIS IS exactly linear regression! X = feature matrix (n_samples x n_features) y = label vector w* = (X^T X)^-1 X^T y PROBLEM with the normal equation: X^T X can be ill-conditioned (kappa >> 1) -> use scipy.linalg.lstsq or sklearn LinearRegression (internally they use SVD, not explicit inversion)

**When lstsq beats the normal equation**: always, when features are potentially collinear or the feature matrix is ill-conditioned. lstsq uses SVD internally - even for rank-deficient matrices it returns the minimum-norm solution ||x*||. sklearn LinearRegression does exactly the same thing.

What is the key idea of the concept 'Overdetermined systems: lstsq and the normal equation'?

Check that the concept material has been understood.

LU, Cholesky, QR: choosing the right algorithm

np.linalg.solve picks an algorithm automatically, but knowing which one it uses helps diagnose problems and select scipy functions explicitly for better performance.

Algorithm	When to use	Complexity	Example
LU (DGESV)	General square matrix	O(n^3/3)	np.linalg.solve by default
Cholesky	Symmetric positive definite	O(n^3/6)	scipy.linalg.cho_solve - Ridge, GP
QR	Overdetermined systems	O(mn^2)	scipy.linalg.lstsq for m >> n
SVD	Rank-deficient, pseudoinverse	O(mn^2)	np.linalg.lstsq internally, LoRA

What is the key idea of the concept 'LU, Cholesky, QR: choosing the right algorithm'?

Check that the concept material has been understood.

Conjugate Gradient: when the matrix does not fit in memory

LU decomposition requires O(n^3) operations and O(n^2) memory. At n = 10^6 (typical for graph problems, PDEs, large sparse systems) this is infeasible. The ~conjugate gradient method~{conjugate gradient: iterative method for SPD systems} (CG) solves Ax = b iteratively: each step costs O(n) multiplications and it converges in at most n steps.

Problem: min f(x) = (1/2) x^T A x - b^T x Gradient: grad f(x) = Ax - b = 0 -> this is Ax = b Solving the system IS the same as minimizing f(x)! CG treats it as an optimization problem. EACH STEP: r_k = b - A * x_k (residual) p_k = -r_k + beta * p_{k-1} (conjugate direction) x_{k+1} = x_k + alpha_k * p_k WHERE IT IS USED: scipy.sparse.linalg.cg(A, b) - sparse SPD matrices PyTorch L-BFGS optimizer - NN training with approximate CG Gaussian Process posterior - GPyTorch uses CG instead of Cholesky

**Preconditioned CG in neural network optimization**: plain gradient descent is CG with zero memory of past directions. L-BFGS and Hessian-free optimization use approximations of conjugate directions. This is why L-BFGS in sklearn.LogisticRegression converges faster than SGD on tabular data - it accounts for the curvature of the loss surface.

What is the key idea of the concept 'Conjugate Gradient: when the matrix does not fit in memory'?

Check that the concept material has been understood.

Homogeneous systems: the null space of a matrix

A ~homogeneous system~{homogeneous system: Ax = 0} asks "which vectors does A map to zero?". The set of all such vectors is called the ~null space~{null space / kernel} of A.

Ax = 0 always has the trivial solution x = 0. IF det(A) != 0 -> only x = 0 (null space is trivial) IF det(A) = 0 -> nontrivial null space (infinitely many vectors) GEOMETRY: Projection onto X-axis: null space = entire Y-axis Rotation: null space = only {0} ML CONNECTION: The null space of a weight matrix in a neural network = directions the layer cannot see at all. A nontrivial null space is why underdetermined systems have infinitely many solutions - regularization (L2 / L1 / dropout) picks one specific solution.

Ax = b in ML infrastructure

Where solving systems is happening right now

Component	Role	Details
sklearn LinearRegression	lstsq via SVD under the hood	Normal equation X^T X w = X^T y, solved via scipy.linalg.lstsq with rcond threshold
Ridge / Lasso Regression	(X^T X + lambda I)w = X^T y - always SPD, Cholesky	Ridge: analytic, Lasso: ADMM / coordinate descent (cannot be solved as a plain linear system)
Gaussian Process posterior	mu* = K(X*,X)(K(X,X)+sigma^2 I)^-1 y - SPD, CG in GPyTorch	GPyTorch uses CG with preconditioning instead of Cholesky - O(n) memory vs O(n^2)
Kalman filter	State prediction: P = A P A^T + Q (SPD, Cholesky)	Drones, object tracking, SLAM - hundreds of Ax=b solves per second in real time

What is the key idea of the concept 'Homogeneous systems: the null space of a matrix'?

Check that the concept material has been understood.

Practice: solve a linear system

Interview questions

Why is np.linalg.solve(A, b) better than np.linalg.inv(A) @ b? When is inv actually needed?

- solve uses LU directly - O(n^3/3), never builds the inverse matrix - inv(A) @ b = two steps: O(n^3) + O(n^2) - extra multiplications and rounding - With 100 different b: solve(A, B) runs LU once and does 100 forward-back substitutions O(n^2) - inv is needed only when A^-1 itself is an object of interest - rare in practice

A dataset has 10000 samples and 500 features. The matrix X^T X is ill-conditioned. How to solve linear regression reliably?

- np.linalg.lstsq(X, y) uses SVD internally - stable for any rank - Ridge adds lambda*I: (X^T X + lambda I)w = X^T y - matrix becomes SPD with kappa = (sigma_max^2+lambda)/(sigma_min^2+lambda) - Small lambda dramatically improves conditioning: sigma_min^2=0.001, lambda=0.1 reduces kappa by 100x - sklearn Ridge internally calls scipy.linalg.solve_triangular after Cholesky factorization

When is conjugate gradient preferred over direct methods like LU or Cholesky?

- When n > 10^4 and the matrix is sparse: LU needs O(n^2) memory, CG needs O(n) - Sparse matrices (graphs, PDEs): matrix-vector product A*v is cheap, CG converges in sqrt(kappa) iterations - In neural network optimization: L-BFGS is CG with approximate Hessian information - Direct solvers needed when kappa < 10^6 and storing the factorization is feasible

What is the key idea of the concept 'Practice: solve a linear system'?

Check that the concept material has been understood.

Key takeaways

**Ax = b** - one equation behind three notations; matrix form unlocks GPU and library solvers
**np.linalg.solve** is faster and more accurate than inv(A) @ b - always use it when solving a system
**Overdetermined systems** (m > n) have no exact solution; lstsq minimizes ||Ax - b||^2 via SVD
**Normal equation** A^T A x = A^T b - this is all of linear regression in one formula
**Algorithm choice**: LU - general, Cholesky - SPD (2x faster), QR/SVD - overdetermined, CG - sparse/huge
**Null space** = directions the system cannot see; underdetermined systems need regularization to pick one solution

Where to go next

Systems of equations sit at the intersection of all of linear algebra

Matrix properties — Conditioning and SPD-ness determine which solver to choose
Inverse matrix — A^-1 exists <=> det(A) != 0 <=> Ax=b has a unique solution
SVD and pseudoinverse — lstsq = A+ b via SVD; pseudoinverse generalizes solution to rank-deficient cases

Связанные уроки

nm-01

Linear Algebra

Systems of linear equations: from school to TensorFlow

Linear regression: normal equations (XᵀX)w = Xᵀy - a linear system for the weights
Computer graphics: finding a ray-plane intersection - a 3-equation system
Electrical engineering: Kirchhoff's laws for a circuit with n nodes - n equations
Economics: Walrasian market equilibrium - a large linear system
GPS: position from four satellite range equations

Systems of linear equations: from school to TensorFlow

What is the key idea of the concept 'Systems of linear equations: from school to TensorFlow'?

Check that the concept material has been understood.

Three forms of the same equation

What is the key idea of the concept 'Three forms of the same equation'?

Check that the concept material has been understood.

Number of solutions: three cases

Case	Geometry (3D)	Condition	In ML
One solution	Three planes meet at a point	det(A) != 0, rank = n	Exact fit (rare)
Infinitely many	Planes intersect along a line	det(A) = 0, b in image of A	Underdetermined (n < d features)
No solutions	Planes have no common point	b outside image of A	Overdetermined - use lstsq!

What is the key idea of the concept 'Number of solutions: three cases'?

Check that the concept material has been understood.

np.linalg.solve vs inv: why inv is slower

What is the key idea of the concept 'np.linalg.solve vs inv: why inv is slower'?

Check that the concept material has been understood.

Overdetermined systems: lstsq and the normal equation

What is the key idea of the concept 'Overdetermined systems: lstsq and the normal equation'?

Check that the concept material has been understood.

LU, Cholesky, QR: choosing the right algorithm

np.linalg.solve picks an algorithm automatically, but knowing which one it uses helps diagnose problems and select scipy functions explicitly for better performance.

Algorithm	When to use	Complexity	Example
LU (DGESV)	General square matrix	O(n^3/3)	np.linalg.solve by default
Cholesky	Symmetric positive definite	O(n^3/6)	scipy.linalg.cho_solve - Ridge, GP
QR	Overdetermined systems	O(mn^2)	scipy.linalg.lstsq for m >> n
SVD	Rank-deficient, pseudoinverse	O(mn^2)	np.linalg.lstsq internally, LoRA

What is the key idea of the concept 'LU, Cholesky, QR: choosing the right algorithm'?

Check that the concept material has been understood.

Conjugate Gradient: when the matrix does not fit in memory

What is the key idea of the concept 'Conjugate Gradient: when the matrix does not fit in memory'?

Check that the concept material has been understood.

Homogeneous systems: the null space of a matrix

A ~homogeneous system~{homogeneous system: Ax = 0} asks "which vectors does A map to zero?". The set of all such vectors is called the ~null space~{null space / kernel} of A.

Ax = b in ML infrastructure

Where solving systems is happening right now

Component	Role	Details
sklearn LinearRegression	lstsq via SVD under the hood	Normal equation X^T X w = X^T y, solved via scipy.linalg.lstsq with rcond threshold
Ridge / Lasso Regression	(X^T X + lambda I)w = X^T y - always SPD, Cholesky	Ridge: analytic, Lasso: ADMM / coordinate descent (cannot be solved as a plain linear system)
Gaussian Process posterior	mu* = K(X*,X)(K(X,X)+sigma^2 I)^-1 y - SPD, CG in GPyTorch	GPyTorch uses CG with preconditioning instead of Cholesky - O(n) memory vs O(n^2)
Kalman filter	State prediction: P = A P A^T + Q (SPD, Cholesky)	Drones, object tracking, SLAM - hundreds of Ax=b solves per second in real time

What is the key idea of the concept 'Homogeneous systems: the null space of a matrix'?

Check that the concept material has been understood.

Practice: solve a linear system

Interview questions

Why is np.linalg.solve(A, b) better than np.linalg.inv(A) @ b? When is inv actually needed?

A dataset has 10000 samples and 500 features. The matrix X^T X is ill-conditioned. How to solve linear regression reliably?

When is conjugate gradient preferred over direct methods like LU or Cholesky?

What is the key idea of the concept 'Practice: solve a linear system'?

Check that the concept material has been understood.

Key takeaways

**Ax = b** - one equation behind three notations; matrix form unlocks GPU and library solvers
**np.linalg.solve** is faster and more accurate than inv(A) @ b - always use it when solving a system
**Overdetermined systems** (m > n) have no exact solution; lstsq minimizes ||Ax - b||^2 via SVD
**Normal equation** A^T A x = A^T b - this is all of linear regression in one formula
**Algorithm choice**: LU - general, Cholesky - SPD (2x faster), QR/SVD - overdetermined, CG - sparse/huge
**Null space** = directions the system cannot see; underdetermined systems need regularization to pick one solution

Where to go next

Systems of equations sit at the intersection of all of linear algebra

Matrix properties — Conditioning and SPD-ness determine which solver to choose
Inverse matrix — A^-1 exists <=> det(A) != 0 <=> Ax=b has a unique solution
SVD and pseudoinverse — lstsq = A+ b via SVD; pseudoinverse generalizes solution to rank-deficient cases

Связанные уроки

nm-01