Calculus

Sequences: How Infinity Converges

In 1821 Cauchy published «Cours d'analyse» and gave the first rigorous definition of a limit. Before him, 150 years of mathematicians had worked with «infinitesimals» without a definition. Today that formula lives in PyTorch's source code, and every neural network in the world trains on its consequences.

SGD training loss is a sequence converging to a minimum. The convergence rate sets the cost of training.
PageRank power iteration: $p_{n+1} = M \cdot p_n$ - 50 iterations on $10^{11}$ pages. It works because the sequence converges.
The number $e$ is the limit of $(1 + 1/n)^n$. It underpins softmax, exponential decay, and compound interest.

What a Sequence Is

A sequence is an ordered list of numbers $a_1, a_2, a_3, \ldots$ where each natural number $n$ is paired with exactly one number $a_n$. Formally: a function $a: \mathbb{N} \to \mathbb{R}$.

Reading	What it is	Who reads it this way
A list with indices	$a_n$ - a function on $\mathbb{N}$	Mathematician, textbook
An iterative process	Next state from previous state	Programmer (for-loop, `while not converged`)
A trajectory in time	System evolving step by step	ML engineer (training loss at epoch $n$)

Sequences come in two forms: **closed-form** $a_n = f(n)$ - the value is computed directly from the index - and **recurrent** $a_{n+1} = f(a_n)$ - each element depends on the previous one.

In ML, recurrent sequences are everywhere: SGD ($\theta_{n+1} = \theta_n - \eta \nabla L(\theta_n)$), Adam, RNNs, diffusion models - all recurrences. Closed-form expressions are rare in real problems.

**Monotonicity**: a sequence is monotonically increasing if $a_{n+1} \geq a_n$ for all $n$, and monotonically decreasing if $a_{n+1} \leq a_n$. This structural property is central to proving convergence.

Sequence	What converges	Where it runs
SGD training loss	$L_n \to$ loss minimum	Every neural network training run
Adam optimizer moments	$m_n, v_n$ - exponential moving averages	Default in PyTorch, TensorFlow, JAX
PageRank power iteration	$p_n \to$ stationary distribution	Google Search from day one
$(1 + 1/n)^n$	$\to e \approx 2.718$	Continuous compounding, softmax

Gradient descent is defined by $\theta_{n+1} = \theta_n - \eta \nabla L(\theta_n)$. Which form of sequence definition is this?

The Limit of a Sequence - Cauchy's Definition

In 1821 Cauchy published «Cours d'analyse» and gave the first rigorous definition of a limit. Before him, 150 years of mathematicians had worked with «infinitesimals» without a definition. Cauchy's breakthrough was turning «close» into a game between two players.

Read it as a dialogue between two engineers. Skeptic: «I want the sequence inside a window of size $\varepsilon = 0.001$ around $L$.» Defender: «Then wait $N = 10{,}000$ iterations. After that every term fits.» If for **any** $\varepsilon$ the defender has an answer - the limit exists.

**Why the rigor matters**: without a numerical tolerance, «close» is a feeling. With $\varepsilon$, «close» becomes an integer $N$ a machine can count. That is why the same formula lives inside a compiler, a training loop, and a control system.

**Limit arithmetic**: when limits exist, they behave like numbers. $\lim(a_n + b_n) = \lim a_n + \lim b_n$, $\lim(a_n \cdot b_n) = \lim a_n \cdot \lim b_n$, $\lim(a_n / b_n) = \lim a_n / \lim b_n$ when $\lim b_n \neq 0$.

Standard trick: divide by the highest power

For rational sequences

Find $\lim_{n \to \infty} \frac{3n^2 + 2n - 1}{n^2 + 5}$. Both numerator and denominator go to $\infty$ - the indeterminate form $\frac{\infty}{\infty}$. Divide numerator and denominator by $n^2$: $$= \lim \frac{3 + 2/n - 1/n^2}{1 + 5/n^2} = \frac{3 + 0 - 0}{1 + 0} = 3$$ For rational sequences only the leading powers matter. That is why big-O analysis writes $O(n^2)$ and drops lower-order terms - same math.

Sequence $a_n = (-1)^n / n$: $-1, 0.5, -0.33, 0.25, \ldots$ - converges or diverges?

Convergence Criteria

Sometimes finding the limit is hard, but knowing whether it exists at all is what matters - to decide whether to keep training, to prove an algorithm is correct, to bound worst-case runtime.

**Weierstrass theorem (monotone convergence)**: if a sequence is **monotone** (only growing or only shrinking) and **bounded** (lives inside some interval), it **necessarily converges**. Picture stairs under a ceiling. The steps only go up, but the ceiling blocks them. They must stop somewhere at some level $\leq$ ceiling.

Euler's number from finance

$e$ as the limit of continuous compounding

A $1 deposit, annual rate 100%, compounded $n$ times per year: $$a_n = \left(1 + \frac{1}{n}\right)^n$$ $a_1 = 2$, $a_5 \approx 2.49$, $a_{100} \approx 2.705$, $a_{10^6} \approx 2.71828$... **Monotone**: $a_{n+1} > a_n$ (provable). **Bounded**: $a_n < 3$ always. Weierstrass theorem: the limit exists. We name it $e \approx 2.71828$. This limit is continuous compounding - the cap on what infinite re-investing can give. Every softmax in a neural net runs on $e$.

**Cauchy criterion**: a sequence converges if and only if it is Cauchy: $\forall \varepsilon > 0 \; \exists N: \forall m, n > N \; |a_m - a_n| < \varepsilon$. The terms themselves cluster together - that is enough for convergence, even when we do not know the limit.

**Three divergence modes** worth recognizing: 1. $a_n = n$: runaway to $+\infty$. In ML: «the loss exploded», learning rate too large. 2. $a_n = (-1)^n$: oscillation. In ML: the optimizer bouncing across the valley. 3. $a_n = \sin(n)$: wandering with no trend. In ML: «training is unstable, restart with a different seed».

**Convergence rate sets the cost of training**: $O(1/\sqrt{n})$ for Monte Carlo vs $O(\rho^n)$ for gradient descent vs $O(1/n)$ for averaged SGD. One theorem about sequences determines whether a model trains in a day or a month.

«Approaches zero» = «eventually becomes zero»

A limit is a tendency, not a terminal value

$\frac{1}{n}$ never equals zero - $\frac{1}{10^{12}}$ is still positive. Terms become *arbitrarily close* to zero. In ML: the loss never hits zero either, but can get below any requested tolerance.

The sequence $a_n = (1 + 1/n)^n$ is monotonically increasing and bounded above by 3. What does the Weierstrass theorem say?

Summary

A sequence = an ordered list of numbers $a_1, a_2, \ldots$ - a function on $\mathbb{N}$
A limit is a tendency, not a terminal value. Rigorously: $\forall \varepsilon > 0 \; \exists N: \forall n > N \; |a_n - L| < \varepsilon$
Convergence vs. divergence is the binary question behind every training run and iterative solver
Weierstrass theorem: monotone + bounded => converges
Three divergence modes: runaway, oscillation, wandering - each has a direct analogue in ML

What's next

Sequences are the alphabet. Everything in calculus is built on their limits:

Series — Partial sums form a sequence. Fourier and Taylor series are limits of sequences of sums
Derivative — A limit of difference quotients. Gradient, backprop, automatic differentiation all live here
Integral — Limit of Riemann sums. Every expectation, every area, every continuous probability

Вопросы для размышления

If a neural network's training loss keeps decreasing but never reaches zero, what does it share with Zeno's walking-to-the-wall?
Gradient descent diverged on a new dataset. In the language of this lesson, which divergence mode is it and why?
Write down a recurrence from a familiar context (a loop, a compound-interest formula, a game simulation). Is it a convergent sequence? What is the limit?

Связанные уроки

stat-02-estimation