Calculus
Sequences: How Infinity Converges
In 1821 Cauchy published «Cours d'analyse» and gave the first rigorous definition of a limit. Before him, 150 years of mathematicians had worked with «infinitesimals» without a definition. Today that formula lives in PyTorch's source code, and every neural network in the world trains on its consequences.
- SGD training loss is a sequence converging to a minimum. The convergence rate sets the cost of training.
- PageRank power iteration: $p_{n+1} = M \cdot p_n$ - 50 iterations on $10^{11}$ pages. It works because the sequence converges.
- The number $e$ is the limit of $(1 + 1/n)^n$. It underpins softmax, exponential decay, and compound interest.
What a Sequence Is
A sequence is an ordered list of numbers $a_1, a_2, a_3, \ldots$ where each natural number $n$ is paired with exactly one number $a_n$. Formally: a function $a: \mathbb{N} \to \mathbb{R}$.
| Reading | What it is | Who reads it this way |
|---|---|---|
| A list with indices | $a_n$ - a function on $\mathbb{N}$ | Mathematician, textbook |
| An iterative process | Next state from previous state | Programmer (for-loop, `while not converged`) |
| A trajectory in time | System evolving step by step | ML engineer (training loss at epoch $n$) |
Sequences come in two forms: **closed-form** $a_n = f(n)$ - the value is computed directly from the index - and **recurrent** $a_{n+1} = f(a_n)$ - each element depends on the previous one.
In ML, recurrent sequences are everywhere: SGD ($\theta_{n+1} = \theta_n - \eta \nabla L(\theta_n)$), Adam, RNNs, diffusion models - all recurrences. Closed-form expressions are rare in real problems.
**Monotonicity**: a sequence is monotonically increasing if $a_{n+1} \geq a_n$ for all $n$, and monotonically decreasing if $a_{n+1} \leq a_n$. This structural property is central to proving convergence.
| Sequence | What converges | Where it runs |
|---|---|---|
| SGD training loss | $L_n \to$ loss minimum | Every neural network training run |
| Adam optimizer moments | $m_n, v_n$ - exponential moving averages | Default in PyTorch, TensorFlow, JAX |
| PageRank power iteration | $p_n \to$ stationary distribution | Google Search from day one |
| $(1 + 1/n)^n$ | $\to e \approx 2.718$ | Continuous compounding, softmax |
Gradient descent is defined by $\theta_{n+1} = \theta_n - \eta \nabla L(\theta_n)$. Which form of sequence definition is this?
The Limit of a Sequence - Cauchy's Definition
In 1821 Cauchy published «Cours d'analyse» and gave the first rigorous definition of a limit. Before him, 150 years of mathematicians had worked with «infinitesimals» without a definition. Cauchy's breakthrough was turning «close» into a game between two players.
Read it as a dialogue between two engineers. Skeptic: «I want the sequence inside a window of size $\varepsilon = 0.001$ around $L$.» Defender: «Then wait $N = 10{,}000$ iterations. After that every term fits.» If for **any** $\varepsilon$ the defender has an answer - the limit exists.
**Why the rigor matters**: without a numerical tolerance, «close» is a feeling. With $\varepsilon$, «close» becomes an integer $N$ a machine can count. That is why the same formula lives inside a compiler, a training loop, and a control system.
**Limit arithmetic**: when limits exist, they behave like numbers. $\lim(a_n + b_n) = \lim a_n + \lim b_n$, $\lim(a_n \cdot b_n) = \lim a_n \cdot \lim b_n$, $\lim(a_n / b_n) = \lim a_n / \lim b_n$ when $\lim b_n \neq 0$.
Standard trick: divide by the highest power
For rational sequences
Find $\lim_{n \to \infty} \frac{3n^2 + 2n - 1}{n^2 + 5}$. Both numerator and denominator go to $\infty$ - the indeterminate form $\frac{\infty}{\infty}$. Divide numerator and denominator by $n^2$: $$= \lim \frac{3 + 2/n - 1/n^2}{1 + 5/n^2} = \frac{3 + 0 - 0}{1 + 0} = 3$$ For rational sequences only the leading powers matter. That is why big-O analysis writes $O(n^2)$ and drops lower-order terms - same math.
Sequence $a_n = (-1)^n / n$: $-1, 0.5, -0.33, 0.25, \ldots$ - converges or diverges?
Convergence Criteria
Sometimes finding the limit is hard, but knowing whether it exists at all is what matters - to decide whether to keep training, to prove an algorithm is correct, to bound worst-case runtime.
**Weierstrass theorem (monotone convergence)**: if a sequence is **monotone** (only growing or only shrinking) and **bounded** (lives inside some interval), it **necessarily converges**. Picture stairs under a ceiling. The steps only go up, but the ceiling blocks them. They must stop somewhere at some level $\leq$ ceiling.
Euler's number from finance
$e$ as the limit of continuous compounding
A $1 deposit, annual rate 100%, compounded $n$ times per year: $$a_n = \left(1 + \frac{1}{n}\right)^n$$ $a_1 = 2$, $a_5 \approx 2.49$, $a_{100} \approx 2.705$, $a_{10^6} \approx 2.71828$... **Monotone**: $a_{n+1} > a_n$ (provable). **Bounded**: $a_n < 3$ always. Weierstrass theorem: the limit exists. We name it $e \approx 2.71828$. This limit is continuous compounding - the cap on what infinite re-investing can give. Every softmax in a neural net runs on $e$.
**Cauchy criterion**: a sequence converges if and only if it is Cauchy: $\forall \varepsilon > 0 \; \exists N: \forall m, n > N \; |a_m - a_n| < \varepsilon$. The terms themselves cluster together - that is enough for convergence, even when we do not know the limit.
**Three divergence modes** worth recognizing: 1. $a_n = n$: runaway to $+\infty$. In ML: «the loss exploded», learning rate too large. 2. $a_n = (-1)^n$: oscillation. In ML: the optimizer bouncing across the valley. 3. $a_n = \sin(n)$: wandering with no trend. In ML: «training is unstable, restart with a different seed».
**Convergence rate sets the cost of training**: $O(1/\sqrt{n})$ for Monte Carlo vs $O(\rho^n)$ for gradient descent vs $O(1/n)$ for averaged SGD. One theorem about sequences determines whether a model trains in a day or a month.
«Approaches zero» = «eventually becomes zero»
A limit is a tendency, not a terminal value
$\frac{1}{n}$ never equals zero - $\frac{1}{10^{12}}$ is still positive. Terms become *arbitrarily close* to zero. In ML: the loss never hits zero either, but can get below any requested tolerance.
The sequence $a_n = (1 + 1/n)^n$ is monotonically increasing and bounded above by 3. What does the Weierstrass theorem say?
Summary
- A sequence = an ordered list of numbers $a_1, a_2, \ldots$ - a function on $\mathbb{N}$
- A limit is a tendency, not a terminal value. Rigorously: $\forall \varepsilon > 0 \; \exists N: \forall n > N \; |a_n - L| < \varepsilon$
- Convergence vs. divergence is the binary question behind every training run and iterative solver
- Weierstrass theorem: monotone + bounded => converges
- Three divergence modes: runaway, oscillation, wandering - each has a direct analogue in ML
What's next
Sequences are the alphabet. Everything in calculus is built on their limits:
- Series — Partial sums form a sequence. Fourier and Taylor series are limits of sequences of sums
- Derivative — A limit of difference quotients. Gradient, backprop, automatic differentiation all live here
- Integral — Limit of Riemann sums. Every expectation, every area, every continuous probability
Вопросы для размышления
- If a neural network's training loss keeps decreasing but never reaches zero, what does it share with Zeno's walking-to-the-wall?
- Gradient descent diverged on a new dataset. In the language of this lesson, which divergence mode is it and why?
- Write down a recurrence from a familiar context (a loop, a compound-interest formula, a game simulation). Is it a convergent sequence? What is the limit?