Linear Algebra

Matrix properties: the numbers that decide success

If the determinant of a weight matrix is zero, the layer is degenerate and destroys information. The condition number of the Hessian determines how fast Adam converges. Matrix properties are not abstract theory - they are the diagnostics of any numerical system.

Neural network health: a near-zero determinant in a weight matrix signals a degenerate layer
Numerical stability: condition number κ = how much input error gets amplified in the output
Physics: the Jacobian determinant is the volume scaling factor of a transformation
Control theory: the trace of the system matrix is related to the sum of eigenvalues
Statistics: the determinant of the covariance matrix is the generalized variance

Matrix properties: the numbers that decide success

**A neural network stopped training at epoch three. Loss froze, gradients exploded.** The cause is almost always the same - the weight matrix or Hessian became ill-conditioned. The ~condition number~{condition number: kappa = sigma_max / sigma_min} grew to a billion, and float32 simply stopped distinguishing values that differ by 10^-9. That is not a bug in the code - it is a property of the matrix. Understanding matrix properties means understanding why algorithms work or break.

Three adjectives about matrices that look like exam vocabulary - "symmetric", "positive definite", "well-conditioned" - but each one silently controls whether the optimizer converges, which solver algorithm PyTorch picks internally, and how many bits of precision survive in GPU training. Today we rip the vocabulary off and show what these properties actually decide in production.

What is the key idea of the concept 'Matrix properties: the numbers that decide success'?

Check that the concept material has been understood.

Condition number: the health check for a matrix

The ~condition number~{kappa = sigma_max / sigma_min, where sigma are the singular values} kappa measures how much an error in the input gets amplified in the output. If the input has an error of delta, the solution error can be as large as kappa * delta.

WELL-CONDITIONED (kappa ~= 1): A = [[3, 0], sigma_max = 3, sigma_min = 1 [0, 1]] kappa = 3 / 1 = 3 Input error 0.1% -> solution error <= 0.3% ILL-CONDITIONED (kappa >> 1): A = [[1.000, 0.999], sigma_max ~= 2, sigma_min ~= 0.001 [0.999, 0.998]] kappa ~= 2000 Input error 0.1% -> solution error can reach 200% PRACTICAL THRESHOLD: float32: loses ~7 digits -> kappa > 10^7 gives garbage float64: loses ~15 digits -> kappa > 10^15 gives garbage

**float32 vs float64 in PyTorch**: PyTorch defaults to float32 (~7 decimal digits). When the condition number kappa exceeds 10^6, solving a system in float32 produces essentially random output. This is why critical operations (np.linalg.solve, scipy.linalg.lstsq) internally switch to float64 even when the inputs are float32.

What is the key idea of the concept 'Condition number: the health check for a matrix'?

Check that the concept material has been understood.

Symmetric matrices: a special class

A matrix is ~symmetric~{symmetric matrix: A = A^T, meaning a_ij = a_ji} when A = A^T. This is not just aesthetics - symmetric matrices always have real eigenvalues, orthogonal eigenvectors, and admit faster and more stable algorithms.

GRAM MATRIX: Given a set of vectors X (n x d) - for example, a batch of embeddings. G = X * X^T -> shape (n x n) G_ij = <x_i, x_j> (dot product of two vectors) G is ALWAYS symmetric: G_ij = <x_i, x_j> = <x_j, x_i> = G_ji WHERE IT APPEARS: Kernel matrix in SVM: K_ij = k(x_i, x_j) Covariance matrix: Sigma = (1/n) X^T X Self-attention in transformers: QK^T (symmetric when Q = K) Hessian in optimization: H_ij = d^2 L / (dw_i dw_j)

**eigvalsh vs eig**: for symmetric matrices NumPy provides eigvalsh - a specialized algorithm that is 2x faster and numerically more stable. Never use eig on a known symmetric matrix - it is wasted computation and precision.

What is the key idea of the concept 'Symmetric matrices: a special class'?

Check that the concept material has been understood.

Positive Definite: a matrix with guarantees

A symmetric matrix A is called ~positive definite~{positive definite: x^T A x > 0 for all x != 0} (SPD, symmetric positive definite) when x^T A x > 0 for every nonzero vector x. All eigenvalues are strictly positive. This is the healthiest class of matrices in numerical methods.

A = [[4, 2], x^T A x at x = (1, 0): 1*4*1 = 4 > 0 [2, 3]] at x = (0, 1): 1*3*1 = 3 > 0 at x = (1, 1): 4+2+2+3 = 11 > 0 EIGENVALUES: lambda_1 ~= 5.56, lambda_2 ~= 1.44 - both positive -> PD PROOF BY CHOLESKY: A is positive definite <=> A = L L^T decomposition exists If Cholesky does not throw an error - the matrix is PD. (In scipy: scipy.linalg.cholesky)

**Why Cholesky is the standard for SPD**: LU decomposition for a general matrix requires n^3/3 operations. Cholesky needs n^3/6 - twice as fast - and is guaranteed stable without pivoting. sklearn and scipy apply Cholesky automatically when they detect SPD input: GaussianProcessRegressor, Ridge regression, Kalman filter.

**PD matrix = convergence guarantee**: if the Hessian at an optimization point is positive definite, that point is a local minimum (not a saddle, not a maximum). That is why second-order optimizers check the Hessian. Adam, L-BFGS, and Newton's method all implicitly rely on this.

What is the key idea of the concept 'Positive Definite: a matrix with guarantees'?

Check that the concept material has been understood.

Multiplicativity of the determinant

**det(AB) = det(A) * det(B)** - a short formula with a deep consequence: the determinant of a product of transformations equals the product of their determinants. Geometrically, if A scales volumes by k and B scales them by m, then AB scales by k * m.

A = [[2, 0], det(A) = 6 [0, 3]] B = [[1, 1], det(B) = 2 [0, 2]] AB = [[2, 2], det(AB) = 2*6 - 0*2 = 12 = 6*2 OK [0, 6]] WHERE THIS MATTERS: Normalizing flows (NF) in generative models: log p(x) = log p(z) + log |det(df/dz)| det(AB) = det(A)*det(B) lets us compute log-det as a sum of per-layer log-dets - O(n) instead of O(n^3)

What is the key idea of the concept 'Multiplicativity of the determinant'?

Check that the concept material has been understood.

Row operations and the determinant

Gaussian elimination applies elementary row operations. Knowing how each operation changes the determinant allows computing det correctly through triangularization.

Operation	Effect on det	Where used
Row swap	Sign flips (once per swap)	Pivoting in LAPACK/NumPy
Multiply row by k	det multiplied by k	Scaling for stability
Add one row to another	det unchanged	Core Gauss step
Transposition A -> A^T	det unchanged	det(A) = det(A^T)
Triangular matrix	det = product of diagonal	Result of LU decomposition

Matrix properties in ML infrastructure

Where these three concepts appear in real systems

Component	Role	Details
Condition number in deep learning	kappa = sigma_max / sigma_min - diagnosing divergence	Exploding gradients in RNNs, batch normalization reduces layer kappa, gradient clipping as a workaround
SPD matrices in Gaussian Processes	Kernel matrix K must be SPD by definition	sklearn GaussianProcessRegressor, GPyTorch - Cholesky decomposition with a nugget for numerical stability
Gram matrix in kernel SVM	K_ij = k(x_i, x_j) - symmetric and PSD	sklearn SVC with RBF/poly kernel - automatically SPD, libsvm internally uses Cholesky
det in Normalizing Flows	log\|det(J)\| in the change-of-variables formula	RealNVP, Glow, NICE - architectures are specifically chosen for O(n) determinant computation

What is the key idea of the concept 'Row operations and the determinant'?

Check that the concept material has been understood.

Practice: positive definite check

Interview questions

A neural network stops learning at epoch 5, loss becomes NaN. How can matrix conditioning be checked as the cause?

- np.linalg.cond(W) > 10^6 in float32 means backward computations produce garbage - When kappa >> 1, gradients are huge in some directions and near-zero in others - Fixes: batch norm reduces per-layer kappa, gradient clipping, switch to float64 for diagnostics - Xavier/He initialization is specifically tuned to keep kappa reasonable at the start

Why is the matrix (A^T A + lambda I) in Ridge regression always positive definite?

- A^T A is symmetric and positive semidefinite: all eigenvalues >= 0 - Adding lambda I shifts every eigenvalue by lambda: lambda_i + lambda > 0 - Result is strictly positive definite, always invertible, well-conditioned - This is the mathematical point of regularization: stabilizing the matrix before inversion

What is the difference between positive definite and positive semidefinite? When does each appear?

- PD: x^T A x > 0 for all x != 0, all eigenvalues > 0, matrix is invertible - PSD: x^T A x >= 0, some eigenvalues may be 0, matrix can be singular - Gram matrix X X^T is PSD - it can be zero along some directions - When n > d, the Gram matrix has rank <= d < n, so some eigenvalues are 0 - PSD not PD

What is the key idea of the concept 'Practice: positive definite check'?

Check that the concept material has been understood.

Key takeaways

**Condition number** kappa = sigma_max / sigma_min measures sensitivity to errors; kappa > 10^7 in float32 means garbage output
**Symmetric matrix** (A = A^T) appears everywhere - Gram matrix, covariance, Hessian; specialized algorithms are 2x faster
**SPD matrix** guarantees a unique minimum for the quadratic form and enables Cholesky decomposition - the foundation of Ridge, GP, SVM kernels
**det(AB) = det(A) * det(B)** - multiplicativity; the key to O(n) log-det computation in normalizing flows
**Triangular matrix**: det = product of diagonal; this is why LU is the standard algorithm for computing det
Elementary Gauss operations change det predictably - this is the basis for correct computation with pivoting

Where to go next

Matrix properties are the foundation for advanced topics

Systems of linear equations — LU, Cholesky, QR - the choice of algorithm depends on matrix properties
Eigenvalues and eigenvectors — For SPD matrices - all eigenvalues > 0, orthogonal eigenvectors, spectral theorem
SVD and low-rank approximations — Singular values sigma_i directly determine the condition number kappa = sigma_max / sigma_min

Связанные уроки

nm-01