Algebra
Expressions and Equations
Al-Khwarizmi wrote a book in Baghdad in 820 CE. The word "algebra" comes from its title. The word "algorithm" comes from his name. One person, one treatise - two words that now define the industry. The problems he solved: division of inheritance. The problems solved with the same methods today: training neural networks on billions of parameters.
- **Normal equation in ML:** $\hat{\theta} = (X^TX)^{-1}X^Ty$ is a quadratic minimization - the same operation as solving a quadratic equation. sklearn LinearRegression uses it for smaller datasets.
- **PolynomialFeatures in sklearn:** linear model + polynomial features = nonlinear regression without changing the algorithm. The RBF kernel is an infinite polynomial in feature space.
- **Neural network forward pass:** each linear layer solves $y = Wx + b$. BLAS routines tuned for each CPU/GPU architecture execute this in nanoseconds.
Polynomials
**Baghdad, 820 CE.** Muhammad al-Khwarizmi writes a book whose title contains the word *al-jabr* - "restoration". That word becomes "algebra". His name becomes "algorithm". One person gave both words to computing. The problems he solved: division of inheritance. The problems solved with the same methods today: training neural networks on billions of parameters.
A **polynomial** is an expression $a_n x^n + a_{n-1} x^{n-1} + \ldots + a_1 x + a_0$, where $a_i$ are coefficients and $n$ is the **degree** (the highest power with a nonzero coefficient). The degree of a sum is at most the maximum of the degrees. The degree of a product equals the sum of the degrees.
| Polynomial | Degree | Leading coefficient | Constant term |
|---|---|---|---|
| $5x^3 - 2x + 7$ | 3 | 5 | 7 |
| $x^2 - 4$ | 2 | 1 | -4 |
| $-3x + 1$ | 1 | -3 | 1 |
| $42$ | 0 | 42 | 42 |
| $0$ | undefined | - | 0 |
A polynomial is a generalization of a number. The integer $237 = 2 \cdot 10^2 + 3 \cdot 10 + 7$ is the polynomial $2x^2 + 3x + 7$ evaluated at $x = 10$. Addition, subtraction, multiplication - same rules. In ML this is applied directly: sklearn `PolynomialFeatures` maps a feature $x$ to $[1, x, x^2, \ldots, x^d]$. A linear model on top of those features learns nonlinear patterns - the kernel trick (RBF kernel = infinite polynomial) takes this further.
What is the degree of the polynomial $(x^2 + 1)(x^3 - x)$?
Factoring
**Factoring** is the reverse of expanding brackets. But there is something deeper here: the backpropagation algorithm in neural networks is essentially a factoring of the computational graph. The chain rule $\frac{dL}{dx} = \frac{dL}{dz} \cdot \frac{dz}{dx}$ breaks a complex derivative into simple factors. PyTorch does this automatically through `autograd` - every `.backward()` call is a factorization.
**Key principle:** if a product equals zero, at least one factor equals zero. This is exactly what makes factoring powerful for equations: instead of one hard equation, several simple ones emerge - $p(x) = 0 \Rightarrow (x - r_1)(x - r_2)\ldots = 0$.
| Technique | Formula | Example |
|---|---|---|
| Factor out GCF | $ab + ac = a(b+c)$ | $6x^3 + 4x^2 = 2x^2(3x+2)$ |
| Difference of squares | $a^2 - b^2 = (a-b)(a+b)$ | $x^2 - 9 = (x-3)(x+3)$ |
| Perfect square (sum) | $a^2 + 2ab + b^2 = (a+b)^2$ | $x^2 + 6x + 9 = (x+3)^2$ |
| Perfect square (diff) | $a^2 - 2ab + b^2 = (a-b)^2$ | $x^2 - 10x + 25 = (x-5)^2$ |
| Sum of cubes | $a^3 + b^3 = (a+b)(a^2-ab+b^2)$ | $x^3 + 8 = (x+2)(x^2-2x+4)$ |
| Difference of cubes | $a^3 - b^3 = (a-b)(a^2+ab+b^2)$ | $x^3 - 27 = (x-3)(x^2+3x+9)$ |
**Not all polynomials factor over the reals.** $x^2 + 1$ cannot be factored over R, but can over the complex numbers: $(x+i)(x-i)$. Such a polynomial is called **irreducible** over R.
Factor $x^4 - 1$ completely:
Quadratic Equations
**The normal equation of linear regression:** $\hat{\theta} = (X^TX)^{-1}X^Ty$. This is a quadratic minimization - take $L = \|X\theta - y\|^2$, differentiate and set to zero. The same operation as solving $ax^2 + bx + c = 0$. The variable is not $x \in \mathbb{R}$ but $\theta \in \mathbb{R}^d$, the weight vector. School algebra, matrix-sized.
From Babylon to the Normal Equation
Babylonian mathematicians solved quadratic equations geometrically by completing the square, 2000 years before the common era. Al-Khwarizmi systematized the methods in the 9th century. The general formula via the discriminant appeared in Europe in the 16th century. In 1801, Gauss applied the same idea of minimizing a quadratic function to predict the orbit of the asteroid Ceres - that became the method of least squares.
The **discriminant** $D = b^2 - 4ac$ determines the nature of roots: $D > 0$ - two distinct real roots; $D = 0$ - one repeated root; $D < 0$ - two complex conjugate roots. Formula: $x = \frac{-b \pm \sqrt{D}}{2a}$. In SVM the discriminant appears in the margin condition: a separating hyperplane exists when $D \geq 0$.
**Vieta's formulas** connect roots to coefficients without solving explicitly. For $ax^2 + bx + c = 0$ with roots $x_1, x_2$: $x_1 + x_2 = -b/a$ and $x_1 x_2 = c/a$. A quick way to spot roots mentally for simple equations. Vieta analogues extend to cubic equations via the Cardano-Vieta relations.
| Property | Formula | Example: $x^2 - 5x + 6 = 0$ |
|---|---|---|
| Sum of roots | $x_1 + x_2 = -b/a$ | $3 + 2 = 5 = -(-5)/1$ |
| Product of roots | $x_1 x_2 = c/a$ | $3 \cdot 2 = 6 = 6/1$ |
The equation $2x^2 + 3x - 2 = 0$ has roots $x_1$ and $x_2$. What is $x_1 \cdot x_2$?
Systems of Equations
**Every second, billions of devices solve systems of linear equations** - that is what a neural network forward pass is. Each linear layer: $\mathbf{y} = W\mathbf{x} + \mathbf{b}$ is a system $Ax = b$ in matrix form. BLAS routines (Basic Linear Algebra Subprograms), tuned for specific hardware, handle this in nanoseconds. NumPy, PyTorch, TensorFlow all use them under the hood. The solution is found not analytically but through matrix multiplication - because multiplication is faster than inversion.
**Three methods for solving linear systems:** 1. **Substitution** - express one variable in terms of another 2. **Elimination** - add equations to cancel a variable 3. **Matrix method** - write as $Ax = b$, solve via `np.linalg.solve`
**Geometric meaning:** each linear equation in two variables defines a line. The solution is the intersection point. Parallel lines: no solution. Coincident lines: infinitely many solutions. The same picture appears in gradient descent: the loss surface is a function, the minimum is where $\nabla L = 0$ - a system of equations.
| Case | Geometry | Algebra | Number of solutions |
|---|---|---|---|
| Lines intersect | One point | $\det(A) \neq 0$ | Exactly one |
| Lines are parallel | No common points | $\det(A) = 0$, contradiction | No solutions |
| Lines coincide | All points shared | $\det(A) = 0$, identity | Infinitely many |
If the discriminant of a quadratic equation is negative, the equation has no solutions
When $D < 0$ there are no real solutions, but two complex conjugate roots exist
The formula $x = \frac{-b \pm \sqrt{D}}{2a}$ works for $D < 0$ too - $\sqrt{D}$ becomes imaginary. For example, $x^2 + 1 = 0$: $D = -4$, roots $x = \pm i$. Complex numbers were invented precisely for this.
System: $x + 2y = 5$, $2x + 4y = 7$. How many solutions?
Key Ideas
- **Polynomial** $a_n x^n + \ldots + a_0$: degree of a product = sum of degrees. sklearn `PolynomialFeatures` builds them from features.
- **Factoring** = decomposition into multipliers. Backprop is a factoring of the computational graph via the chain rule.
- **Quadratic equation** $ax^2+bx+c=0$: discriminant $D=b^2-4ac$, Vieta's formulas. The normal equation of regression is the same quadratic minimization.
- **Systems of equations:** substitution, elimination, matrix method ($Ax=b$). Every neural network forward pass is a system in matrix form.
- Al-Khwarizmi solved inheritance problems. The same algebra now trains neural networks. The tool has not changed.
Related Topics
Equations and polynomials are the language for formulating optimization problems:
- Numbers and Arithmetic — Types of numbers determine in which set solutions are sought
- Inequalities and Absolute Value — Generalization: replace equality with inequality
- Linear Algebra: Matrices — Systems of linear equations are the foundation of matrix methods
Вопросы для размышления
- Al-Khwarizmi created algebra for dividing inheritance. Gauss applied quadratic minimization to compute an asteroid orbit. Now the normal equation trains models. What connects these three problems across 1200 years?
- Backpropagation is a factoring of the computational graph. Without the chain rule (factoring of derivatives), how would gradients need to be computed? What would change about training speed?
- A system of linear equations may have no solution. What does that mean for gradient descent - when $\nabla L = 0$ has no closed-form solution?
Связанные уроки
- alg-01 — Numbers and arithmetic - the base for polynomial operations
- alg-03 — Factoring opens the door to inequalities and absolute value
- alg-05 — Systems of equations lead directly to matrices
- la-06-gauss — Gaussian elimination scales the addition method to matrices
- stat-09-regression — The normal equation is this lesson's quadratic minimization
- la-01-vectors-intro