Numerical Methods

Errors and Floating-Point Arithmetic

February 25, 1991. Dhahran, Saudi Arabia. A Patriot missile failed to intercept a Scud. 28 American soldiers were killed. The cause: the number $0.1$ is not exactly representable in binary floating point. Over 100 hours of system operation the rounding error accumulated to 0.34 seconds of time offset - that is 500 meters of miss in space. The Patriot missile bug became a textbook example of why numerical methods are not an academic exercise. They are lives.

  • **Neural networks (IEEE-754)**: every forward pass trains on float32. Mixed precision (float16 + float32) halves GPU memory with less than 0.1% accuracy loss. Accumulated rounding errors affect convergence when training billion-parameter models.
  • **Finite element in aerospace**: ANSYS and Nastran solve systems of millions of equations. Accumulated float64 errors over thousands of steps are a fuselage safety question, not an academic curiosity.
  • **GPS navigation**: position is computed by integrating IMU data. Float error drift over an hour of flight - tens of meters. That is why GPS needs external correction.
  • **Banks**: ISO 20022 requires Decimal, not float. $0.1 + 0.2 \ne 0.3$ in float64 - at a billion transactions that is real money lost.
  • **Compilers and JIT**: GCC and LLVM reorder float operations for speed. This is legal under IEEE-754 - but can change the result. That is why `-ffast-math` is off by default.

Floating Point

A computer works with 64 bits. The real numbers between 0 and 1 are uncountably infinite. This means most of them **cannot be represented exactly**. Floating point is a compromise: a wide range ($10^{-308}$ to $10^{308}$) but limited precision (16 significant digits). That same compromise cost 28 lives in 1991.

**Floating point** - representing a real number as: x = ±m × 2^e where m is the mantissa (significant digits) and e is the exponent (scale). Analogy: scientific notation 6.022 × 10²³ - mantissa 6.022, exponent 23. **Problem:** only numbers of the form k/2ⁿ are exactly representable. 0.1 = 1/10 is an infinite binary fraction!

Exactly representable: only fractions of the form $a/2^n$ - 0.5, 0.25, 0.125, 0.375. Numbers 0.1, 0.2, 0.3, 1/3 are infinite binary fractions, truncated to 52 mantissa bits. The error of one operation is $\sim 10^{-16}$, almost invisible. But over 100 hours at 10 ticks per second that is $3.6 \times 10^6$ operations. Errors add up. Patriot accumulated a 0.34-second offset.

Patriot stored time as an integer tick count multiplied by $1/10$ second. But $1/10 = 0.1$ is an infinite binary fraction. In the system's 24-bit arithmetic the error per tick was $9.5 \times 10^{-8}$ seconds. Over $100 \times 3600 \times 10 = 3{,}600{,}000$ ticks that accumulated to $0.34$ seconds. A Scud flies at $\sim 1700$ m/s. $0.34 \times 1700 \approx 578$ meters of miss. Not a story - this is from the official GAO report of 1992.

Why is 0.1 + 0.2 ≠ 0.3 in floating point?

IEEE 754

IEEE 754 is the 1985 standard used by virtually every processor from Cortex-M to A100. A number is stored in three fields: sign (1 bit), exponent (scale), mantissa (significant digits). This same standard determines that $0.1 + 0.2 = 0.30000000000000004$ in Python, JavaScript, C, Java, Rust - identically in all languages.

**IEEE 754 formats:**

FormatSignExponentMantissaTotal bits
float32182332
float641115264

Number = (-1)^sign × 2^(exp - bias) × (1 + mantissa) bias = 127 (float32) or 1023 (float64)

ValueSignExponentMantissaType
+0000...000...0Zero
-0100...000...0Zero (negative)
+∞011...100...0Infinity
NaN011...1≠ 0Not a Number
Normalized0/100...1 - 11...0anyRegular number
Denormalized0/100...0≠ 0Very small (≈ 0)

Special values: **+∞** and **-∞** (overflow from division by zero), **NaN** (0/0, $\sqrt{-1}$ - not a number), **-0** (yes, $-0 = +0$ but $1/(-0) = -\infty$). Denormalized numbers represent very small values near zero with reduced precision. In neural networks, `loss = nan` during training means gradient explosion - the first symptom of a learning rate that is too large.

Range of float64: from $\approx 5 \times 10^{-324}$ to $\approx 1.8 \times 10^{308}$. Precision: ~15-17 significant decimal digits. Numbers $10^{15} + 1$ and $10^{15}$ are indistinguishable in float64 (their difference is less than machine epsilon). This is why Patriot's naive tick counter lost precision - as the number grew in magnitude, relative precision fell.

How many significant decimal digits does float64 provide?

Rounding

Every floating-point operation introduces a rounding error. One operation: $\le \varepsilon/2$ relative error - guaranteed by IEEE 754. The problem starts with chains: a million operations can accumulate $\sim 10^6 \cdot \varepsilon/2 \approx 10^{-10}$ error. Whether errors compound or cancel depends on the algorithm.

**Absolute error:** |x̃ − x|, where x̃ is the approximate and x is the exact value. **Relative error:** |x̃ − x| / |x| (when x ≠ 0). **Machine epsilon (ε)** - the smallest number such that fl(1 + ε) ≠ 1: - float32: ε ≈ 1.19 × 10⁻⁷ - float64: ε ≈ 2.22 × 10⁻¹⁶ Meaning: the relative error of any correctly rounded operation is ≤ ε/2.

**IEEE 754 guarantee:** every basic operation (+, -, ×, ÷, √) yields a **correctly rounded** result - as if computed with infinite precision then rounded to the nearest float. Single operation: error $\le \varepsilon/2$. But the Kahan algorithm shows that a correct ordering of steps allows summing $n$ numbers with $O(\varepsilon)$ error instead of $O(n\varepsilon)$.

Typefloat32float64
Machine epsilon≈ 1.2 × 10⁻⁷≈ 2.2 × 10⁻¹⁶
Significant decimal digits~7~16
Range≈ 10⁻³⁸ - 10³⁸≈ 10⁻³⁰⁸ - 10³⁰⁸
Typical useGPU, ML inferenceScientific computing
Size4 bytes8 bytes

What is the value of 1.0 + 1e-17 in float64 arithmetic?

Cancellation

**Catastrophic cancellation** is the most dangerous floating-point trap. Subtracting two nearly equal numbers causes their significant digits to cancel, leaving only rounding noise. This is exactly why the naive quadratic formula works when $b^2 \gg 4ac$ but loses precision when $b^2 \approx 4ac$.

**Catastrophic cancellation:** when x ≈ y, computing x − y loses precision. Example: x = 1.000000000000001, y = 1.000000000000000 Exact result: 10⁻¹⁵ float64 result: may contain only 1-2 correct digits instead of 16 **Rule:** the number of lost digits ≈ −log₁₀(|x−y| / |x|)

Fighting cancellation: **algebraic reformulation**. Instead of $(1+x) - 1$ for small $x$, use $x$ directly. Instead of $\sqrt{x+1} - \sqrt{x}$, multiply by the conjugate: $\frac{1}{\sqrt{x+1}+\sqrt{x}}$. Instead of the naive quadratic formula, use Vieta's identity: $x_1 x_2 = c/a$. NumPy provides `expm1`, `log1p`, `hypot` - precisely for cases where naive formulas accumulate error.

Floating point is not 'broken' arithmetic. It is an engineering compromise providing a range from $10^{-308}$ to $10^{308}$ with 16 significant digits in 8 bytes. Sufficient for 99% of tasks. But that 1% is Patriot 1991, it is unstable neural network training, it is errors in financial software. Knowing the traps is not academicism. It is the competence separating an engineer from a calculator user.

double precision (float64) is sufficient for any computation

float64 with ~16 significant digits is insufficient for financial computations (Decimal needed), long chains of operations (compensation needed), and certain scientific tasks (quad precision or arbitrary precision needed)

In finance, an error of $0.01$ cent across a billion transactions equals $\$10{,}000$ lost. Banks use `Decimal` (base-10 fixed-point). In science: climate models solve $10^9$ equations over $10^4$ steps - errors accumulate to signal level. Quad precision (float128) and arbitrary precision (mpmath, GMP) exist for cases where float64 is not enough. PyTorch trains on float32, verifies on float64 - a deliberate choice.

Which of the following computations is most susceptible to catastrophic cancellation?

Key Ideas

  • **Floating point:** $x = \pm m \times 2^e$; only numbers of the form $k/2^n$ are exact. $0.1$ is an infinite binary fraction
  • **IEEE 754:** sign + exponent + mantissa; float64 = 52 mantissa bits = ~16 significant decimal digits
  • **Machine epsilon:** $\varepsilon \approx 2.2 \times 10^{-16}$ for float64; relative error of any correctly rounded operation $\le \varepsilon/2$
  • **Catastrophic cancellation:** subtracting nearly equal numbers destroys significant digits. The Patriot missile bug is literally this
  • **Kahan algorithm:** compensated summation - stores lost bits in a separate variable. NumPy uses it inside `np.sum`
  • **Callback**: Patriot 1991 - 28 killed by accumulated rounding error in $0.1$. Every neural network in the world runs on IEEE-754

Related Topics

Understanding floating point is the foundation for all numerical methods:

  • Interpolation — Floating-point errors limit the practical accuracy of interpolating polynomials
  • Splines — Splines are more robust to rounding errors than high-degree polynomials

Вопросы для размышления

  • Neural networks train on float32 but often infer on float16 or int8. Why does lower precision not break predictions? What is quantization-aware training?
  • Python `decimal.Decimal` solves $0.1 + 0.2 \ne 0.3$ but runs 50-100x slower than float64. When is that overhead justified? Why do banks use Decimal specifically?
  • The Patriot missed by $0.34$ seconds after 100 hours due to the representation error of $0.1$. Work it out: the system tick was $1/10$ second. What is the relative error of float32 for $0.1$? Accumulate it over $100 \times 3600 \times 10$ ticks.

Связанные уроки

  • calc-03-limits-intro
Errors and Floating-Point Arithmetic

0

1

Sign In