Numerical Methods
Errors and Floating-Point Arithmetic
February 25, 1991. Dhahran, Saudi Arabia. A Patriot missile failed to intercept a Scud. 28 American soldiers were killed. The cause: the number $0.1$ is not exactly representable in binary floating point. Over 100 hours of system operation the rounding error accumulated to 0.34 seconds of time offset - that is 500 meters of miss in space. The Patriot missile bug became a textbook example of why numerical methods are not an academic exercise. They are lives.
- **Neural networks (IEEE-754)**: every forward pass trains on float32. Mixed precision (float16 + float32) halves GPU memory with less than 0.1% accuracy loss. Accumulated rounding errors affect convergence when training billion-parameter models.
- **Finite element in aerospace**: ANSYS and Nastran solve systems of millions of equations. Accumulated float64 errors over thousands of steps are a fuselage safety question, not an academic curiosity.
- **GPS navigation**: position is computed by integrating IMU data. Float error drift over an hour of flight - tens of meters. That is why GPS needs external correction.
- **Banks**: ISO 20022 requires Decimal, not float. $0.1 + 0.2 \ne 0.3$ in float64 - at a billion transactions that is real money lost.
- **Compilers and JIT**: GCC and LLVM reorder float operations for speed. This is legal under IEEE-754 - but can change the result. That is why `-ffast-math` is off by default.
Floating Point
A computer works with 64 bits. The real numbers between 0 and 1 are uncountably infinite. This means most of them **cannot be represented exactly**. Floating point is a compromise: a wide range ($10^{-308}$ to $10^{308}$) but limited precision (16 significant digits). That same compromise cost 28 lives in 1991.
**Floating point** - representing a real number as: x = ±m × 2^e where m is the mantissa (significant digits) and e is the exponent (scale). Analogy: scientific notation 6.022 × 10²³ - mantissa 6.022, exponent 23. **Problem:** only numbers of the form k/2ⁿ are exactly representable. 0.1 = 1/10 is an infinite binary fraction!
Exactly representable: only fractions of the form $a/2^n$ - 0.5, 0.25, 0.125, 0.375. Numbers 0.1, 0.2, 0.3, 1/3 are infinite binary fractions, truncated to 52 mantissa bits. The error of one operation is $\sim 10^{-16}$, almost invisible. But over 100 hours at 10 ticks per second that is $3.6 \times 10^6$ operations. Errors add up. Patriot accumulated a 0.34-second offset.
Patriot stored time as an integer tick count multiplied by $1/10$ second. But $1/10 = 0.1$ is an infinite binary fraction. In the system's 24-bit arithmetic the error per tick was $9.5 \times 10^{-8}$ seconds. Over $100 \times 3600 \times 10 = 3{,}600{,}000$ ticks that accumulated to $0.34$ seconds. A Scud flies at $\sim 1700$ m/s. $0.34 \times 1700 \approx 578$ meters of miss. Not a story - this is from the official GAO report of 1992.
Why is 0.1 + 0.2 ≠ 0.3 in floating point?
IEEE 754
IEEE 754 is the 1985 standard used by virtually every processor from Cortex-M to A100. A number is stored in three fields: sign (1 bit), exponent (scale), mantissa (significant digits). This same standard determines that $0.1 + 0.2 = 0.30000000000000004$ in Python, JavaScript, C, Java, Rust - identically in all languages.
**IEEE 754 formats:**
| Format | Sign | Exponent | Mantissa | Total bits |
|---|---|---|---|---|
| float32 | 1 | 8 | 23 | 32 |
| float64 | 1 | 11 | 52 | 64 |
Number = (-1)^sign × 2^(exp - bias) × (1 + mantissa) bias = 127 (float32) or 1023 (float64)
| Value | Sign | Exponent | Mantissa | Type |
|---|---|---|---|---|
| +0 | 0 | 00...0 | 00...0 | Zero |
| -0 | 1 | 00...0 | 00...0 | Zero (negative) |
| +∞ | 0 | 11...1 | 00...0 | Infinity |
| NaN | 0 | 11...1 | ≠ 0 | Not a Number |
| Normalized | 0/1 | 00...1 - 11...0 | any | Regular number |
| Denormalized | 0/1 | 00...0 | ≠ 0 | Very small (≈ 0) |
Special values: **+∞** and **-∞** (overflow from division by zero), **NaN** (0/0, $\sqrt{-1}$ - not a number), **-0** (yes, $-0 = +0$ but $1/(-0) = -\infty$). Denormalized numbers represent very small values near zero with reduced precision. In neural networks, `loss = nan` during training means gradient explosion - the first symptom of a learning rate that is too large.
Range of float64: from $\approx 5 \times 10^{-324}$ to $\approx 1.8 \times 10^{308}$. Precision: ~15-17 significant decimal digits. Numbers $10^{15} + 1$ and $10^{15}$ are indistinguishable in float64 (their difference is less than machine epsilon). This is why Patriot's naive tick counter lost precision - as the number grew in magnitude, relative precision fell.
How many significant decimal digits does float64 provide?
Rounding
Every floating-point operation introduces a rounding error. One operation: $\le \varepsilon/2$ relative error - guaranteed by IEEE 754. The problem starts with chains: a million operations can accumulate $\sim 10^6 \cdot \varepsilon/2 \approx 10^{-10}$ error. Whether errors compound or cancel depends on the algorithm.
**Absolute error:** |x̃ − x|, where x̃ is the approximate and x is the exact value. **Relative error:** |x̃ − x| / |x| (when x ≠ 0). **Machine epsilon (ε)** - the smallest number such that fl(1 + ε) ≠ 1: - float32: ε ≈ 1.19 × 10⁻⁷ - float64: ε ≈ 2.22 × 10⁻¹⁶ Meaning: the relative error of any correctly rounded operation is ≤ ε/2.
**IEEE 754 guarantee:** every basic operation (+, -, ×, ÷, √) yields a **correctly rounded** result - as if computed with infinite precision then rounded to the nearest float. Single operation: error $\le \varepsilon/2$. But the Kahan algorithm shows that a correct ordering of steps allows summing $n$ numbers with $O(\varepsilon)$ error instead of $O(n\varepsilon)$.
| Type | float32 | float64 |
|---|---|---|
| Machine epsilon | ≈ 1.2 × 10⁻⁷ | ≈ 2.2 × 10⁻¹⁶ |
| Significant decimal digits | ~7 | ~16 |
| Range | ≈ 10⁻³⁸ - 10³⁸ | ≈ 10⁻³⁰⁸ - 10³⁰⁸ |
| Typical use | GPU, ML inference | Scientific computing |
| Size | 4 bytes | 8 bytes |
What is the value of 1.0 + 1e-17 in float64 arithmetic?
Cancellation
**Catastrophic cancellation** is the most dangerous floating-point trap. Subtracting two nearly equal numbers causes their significant digits to cancel, leaving only rounding noise. This is exactly why the naive quadratic formula works when $b^2 \gg 4ac$ but loses precision when $b^2 \approx 4ac$.
**Catastrophic cancellation:** when x ≈ y, computing x − y loses precision. Example: x = 1.000000000000001, y = 1.000000000000000 Exact result: 10⁻¹⁵ float64 result: may contain only 1-2 correct digits instead of 16 **Rule:** the number of lost digits ≈ −log₁₀(|x−y| / |x|)
Fighting cancellation: **algebraic reformulation**. Instead of $(1+x) - 1$ for small $x$, use $x$ directly. Instead of $\sqrt{x+1} - \sqrt{x}$, multiply by the conjugate: $\frac{1}{\sqrt{x+1}+\sqrt{x}}$. Instead of the naive quadratic formula, use Vieta's identity: $x_1 x_2 = c/a$. NumPy provides `expm1`, `log1p`, `hypot` - precisely for cases where naive formulas accumulate error.
Floating point is not 'broken' arithmetic. It is an engineering compromise providing a range from $10^{-308}$ to $10^{308}$ with 16 significant digits in 8 bytes. Sufficient for 99% of tasks. But that 1% is Patriot 1991, it is unstable neural network training, it is errors in financial software. Knowing the traps is not academicism. It is the competence separating an engineer from a calculator user.
double precision (float64) is sufficient for any computation
float64 with ~16 significant digits is insufficient for financial computations (Decimal needed), long chains of operations (compensation needed), and certain scientific tasks (quad precision or arbitrary precision needed)
In finance, an error of $0.01$ cent across a billion transactions equals $\$10{,}000$ lost. Banks use `Decimal` (base-10 fixed-point). In science: climate models solve $10^9$ equations over $10^4$ steps - errors accumulate to signal level. Quad precision (float128) and arbitrary precision (mpmath, GMP) exist for cases where float64 is not enough. PyTorch trains on float32, verifies on float64 - a deliberate choice.
Which of the following computations is most susceptible to catastrophic cancellation?
Key Ideas
- **Floating point:** $x = \pm m \times 2^e$; only numbers of the form $k/2^n$ are exact. $0.1$ is an infinite binary fraction
- **IEEE 754:** sign + exponent + mantissa; float64 = 52 mantissa bits = ~16 significant decimal digits
- **Machine epsilon:** $\varepsilon \approx 2.2 \times 10^{-16}$ for float64; relative error of any correctly rounded operation $\le \varepsilon/2$
- **Catastrophic cancellation:** subtracting nearly equal numbers destroys significant digits. The Patriot missile bug is literally this
- **Kahan algorithm:** compensated summation - stores lost bits in a separate variable. NumPy uses it inside `np.sum`
- **Callback**: Patriot 1991 - 28 killed by accumulated rounding error in $0.1$. Every neural network in the world runs on IEEE-754
Related Topics
Understanding floating point is the foundation for all numerical methods:
- Interpolation — Floating-point errors limit the practical accuracy of interpolating polynomials
- Splines — Splines are more robust to rounding errors than high-degree polynomials
Вопросы для размышления
- Neural networks train on float32 but often infer on float16 or int8. Why does lower precision not break predictions? What is quantization-aware training?
- Python `decimal.Decimal` solves $0.1 + 0.2 \ne 0.3$ but runs 50-100x slower than float64. When is that overhead justified? Why do banks use Decimal specifically?
- The Patriot missed by $0.34$ seconds after 100 hours due to the representation error of $0.1$. Work it out: the system tick was $1/10$ second. What is the relative error of float32 for $0.1$? Accumulate it over $100 \times 3600 \times 10$ ticks.