Linear Algebra

Matrix properties: the numbers that decide success

If the determinant of a weight matrix is zero, the layer is degenerate and destroys information. The condition number of the Hessian determines how fast Adam converges. Matrix properties are not abstract theory - they are the diagnostics of any numerical system.

  • Neural network health: a near-zero determinant in a weight matrix signals a degenerate layer
  • Numerical stability: condition number κ = how much input error gets amplified in the output
  • Physics: the Jacobian determinant is the volume scaling factor of a transformation
  • Control theory: the trace of the system matrix is related to the sum of eigenvalues
  • Statistics: the determinant of the covariance matrix is the generalized variance

Matrix properties: the numbers that decide success

**A neural network stopped training at epoch three. Loss froze, gradients exploded.** The cause is almost always the same - the weight matrix or Hessian became ill-conditioned. The ~condition number~{condition number: kappa = sigma_max / sigma_min} grew to a billion, and float32 simply stopped distinguishing values that differ by 10^-9. That is not a bug in the code - it is a property of the matrix. Understanding matrix properties means understanding why algorithms work or break.

Three adjectives about matrices that look like exam vocabulary - "symmetric", "positive definite", "well-conditioned" - but each one silently controls whether the optimizer converges, which solver algorithm PyTorch picks internally, and how many bits of precision survive in GPU training. Today we rip the vocabulary off and show what these properties actually decide in production.

What is the key idea of the concept 'Matrix properties: the numbers that decide success'?

Check that the concept material has been understood.

Condition number: the health check for a matrix

Condition number: the health check for a matrix

The ~condition number~{kappa = sigma_max / sigma_min, where sigma are the singular values} kappa measures how much an error in the input gets amplified in the output. If the input has an error of delta, the solution error can be as large as kappa * delta.

WELL-CONDITIONED (kappa ~= 1): A = [[3, 0], sigma_max = 3, sigma_min = 1 [0, 1]] kappa = 3 / 1 = 3 Input error 0.1% -> solution error <= 0.3% ILL-CONDITIONED (kappa >> 1): A = [[1.000, 0.999], sigma_max ~= 2, sigma_min ~= 0.001 [0.999, 0.998]] kappa ~= 2000 Input error 0.1% -> solution error can reach 200% PRACTICAL THRESHOLD: float32: loses ~7 digits -> kappa > 10^7 gives garbage float64: loses ~15 digits -> kappa > 10^15 gives garbage

**float32 vs float64 in PyTorch**: PyTorch defaults to float32 (~7 decimal digits). When the condition number kappa exceeds 10^6, solving a system in float32 produces essentially random output. This is why critical operations (np.linalg.solve, scipy.linalg.lstsq) internally switch to float64 even when the inputs are float32.

What is the key idea of the concept 'Condition number: the health check for a matrix'?

Check that the concept material has been understood.

Symmetric matrices: a special class

Symmetric matrices: a special class

A matrix is ~symmetric~{symmetric matrix: A = A^T, meaning a_ij = a_ji} when A = A^T. This is not just aesthetics - symmetric matrices always have real eigenvalues, orthogonal eigenvectors, and admit faster and more stable algorithms.

GRAM MATRIX: Given a set of vectors X (n x d) - for example, a batch of embeddings. G = X * X^T -> shape (n x n) G_ij = <x_i, x_j> (dot product of two vectors) G is ALWAYS symmetric: G_ij = <x_i, x_j> = <x_j, x_i> = G_ji WHERE IT APPEARS: Kernel matrix in SVM: K_ij = k(x_i, x_j) Covariance matrix: Sigma = (1/n) X^T X Self-attention in transformers: QK^T (symmetric when Q = K) Hessian in optimization: H_ij = d^2 L / (dw_i dw_j)

**eigvalsh vs eig**: for symmetric matrices NumPy provides eigvalsh - a specialized algorithm that is 2x faster and numerically more stable. Never use eig on a known symmetric matrix - it is wasted computation and precision.

What is the key idea of the concept 'Symmetric matrices: a special class'?

Check that the concept material has been understood.

Positive Definite: a matrix with guarantees

Positive Definite: a matrix with guarantees

A symmetric matrix A is called ~positive definite~{positive definite: x^T A x > 0 for all x != 0} (SPD, symmetric positive definite) when x^T A x > 0 for every nonzero vector x. All eigenvalues are strictly positive. This is the healthiest class of matrices in numerical methods.

A = [[4, 2], x^T A x at x = (1, 0): 1*4*1 = 4 > 0 [2, 3]] at x = (0, 1): 1*3*1 = 3 > 0 at x = (1, 1): 4+2+2+3 = 11 > 0 EIGENVALUES: lambda_1 ~= 5.56, lambda_2 ~= 1.44 - both positive -> PD PROOF BY CHOLESKY: A is positive definite <=> A = L L^T decomposition exists If Cholesky does not throw an error - the matrix is PD. (In scipy: scipy.linalg.cholesky)

**Why Cholesky is the standard for SPD**: LU decomposition for a general matrix requires n^3/3 operations. Cholesky needs n^3/6 - twice as fast - and is guaranteed stable without pivoting. sklearn and scipy apply Cholesky automatically when they detect SPD input: GaussianProcessRegressor, Ridge regression, Kalman filter.

**PD matrix = convergence guarantee**: if the Hessian at an optimization point is positive definite, that point is a local minimum (not a saddle, not a maximum). That is why second-order optimizers check the Hessian. Adam, L-BFGS, and Newton's method all implicitly rely on this.

What is the key idea of the concept 'Positive Definite: a matrix with guarantees'?

Check that the concept material has been understood.

Multiplicativity of the determinant

Multiplicativity of the determinant

**det(AB) = det(A) * det(B)** - a short formula with a deep consequence: the determinant of a product of transformations equals the product of their determinants. Geometrically, if A scales volumes by k and B scales them by m, then AB scales by k * m.

A = [[2, 0], det(A) = 6 [0, 3]] B = [[1, 1], det(B) = 2 [0, 2]] AB = [[2, 2], det(AB) = 2*6 - 0*2 = 12 = 6*2 OK [0, 6]] WHERE THIS MATTERS: Normalizing flows (NF) in generative models: log p(x) = log p(z) + log |det(df/dz)| det(AB) = det(A)*det(B) lets us compute log-det as a sum of per-layer log-dets - O(n) instead of O(n^3)

What is the key idea of the concept 'Multiplicativity of the determinant'?

Check that the concept material has been understood.

Row operations and the determinant

Row operations and the determinant

Gaussian elimination applies elementary row operations. Knowing how each operation changes the determinant allows computing det correctly through triangularization.

OperationEffect on detWhere used
Row swapSign flips (once per swap)Pivoting in LAPACK/NumPy
Multiply row by kdet multiplied by kScaling for stability
Add one row to anotherdet unchangedCore Gauss step
Transposition A -> A^Tdet unchangeddet(A) = det(A^T)
Triangular matrixdet = product of diagonalResult of LU decomposition

Matrix properties in ML infrastructure

Where these three concepts appear in real systems

ComponentRoleDetails
Condition number in deep learningkappa = sigma_max / sigma_min - diagnosing divergenceExploding gradients in RNNs, batch normalization reduces layer kappa, gradient clipping as a workaround
SPD matrices in Gaussian ProcessesKernel matrix K must be SPD by definitionsklearn GaussianProcessRegressor, GPyTorch - Cholesky decomposition with a nugget for numerical stability
Gram matrix in kernel SVMK_ij = k(x_i, x_j) - symmetric and PSDsklearn SVC with RBF/poly kernel - automatically SPD, libsvm internally uses Cholesky
det in Normalizing Flowslog|det(J)| in the change-of-variables formulaRealNVP, Glow, NICE - architectures are specifically chosen for O(n) determinant computation

What is the key idea of the concept 'Row operations and the determinant'?

Check that the concept material has been understood.

Practice: positive definite check

Practice: positive definite check

Interview questions

A neural network stops learning at epoch 5, loss becomes NaN. How can matrix conditioning be checked as the cause?

- np.linalg.cond(W) > 10^6 in float32 means backward computations produce garbage - When kappa >> 1, gradients are huge in some directions and near-zero in others - Fixes: batch norm reduces per-layer kappa, gradient clipping, switch to float64 for diagnostics - Xavier/He initialization is specifically tuned to keep kappa reasonable at the start

Why is the matrix (A^T A + lambda I) in Ridge regression always positive definite?

- A^T A is symmetric and positive semidefinite: all eigenvalues >= 0 - Adding lambda I shifts every eigenvalue by lambda: lambda_i + lambda > 0 - Result is strictly positive definite, always invertible, well-conditioned - This is the mathematical point of regularization: stabilizing the matrix before inversion

What is the difference between positive definite and positive semidefinite? When does each appear?

- PD: x^T A x > 0 for all x != 0, all eigenvalues > 0, matrix is invertible - PSD: x^T A x >= 0, some eigenvalues may be 0, matrix can be singular - Gram matrix X X^T is PSD - it can be zero along some directions - When n > d, the Gram matrix has rank <= d < n, so some eigenvalues are 0 - PSD not PD

What is the key idea of the concept 'Practice: positive definite check'?

Check that the concept material has been understood.

Key takeaways

  • **Condition number** kappa = sigma_max / sigma_min measures sensitivity to errors; kappa > 10^7 in float32 means garbage output
  • **Symmetric matrix** (A = A^T) appears everywhere - Gram matrix, covariance, Hessian; specialized algorithms are 2x faster
  • **SPD matrix** guarantees a unique minimum for the quadratic form and enables Cholesky decomposition - the foundation of Ridge, GP, SVM kernels
  • **det(AB) = det(A) * det(B)** - multiplicativity; the key to O(n) log-det computation in normalizing flows
  • **Triangular matrix**: det = product of diagonal; this is why LU is the standard algorithm for computing det
  • Elementary Gauss operations change det predictably - this is the basis for correct computation with pivoting

Where to go next

Matrix properties are the foundation for advanced topics

  • Systems of linear equations — LU, Cholesky, QR - the choice of algorithm depends on matrix properties
  • Eigenvalues and eigenvectors — For SPD matrices - all eigenvalues > 0, orthogonal eigenvectors, spectral theorem
  • SVD and low-rank approximations — Singular values sigma_i directly determine the condition number kappa = sigma_max / sigma_min

Связанные уроки

  • nm-01
Matrix properties: the numbers that decide success

0

1

Sign In