Probability Theory
Random Variables
Цели урока
- Understand what a random variable is and why it's needed
- Distinguish discrete and continuous random variables
- Master the distribution function F(x) and density f(x)
- Understand the paradox "probability of an exact value = 0"
- Learn to find probabilities using a distribution
Предварительные знания
- Basic concepts of probability
- Event independence
Every token GPT generates is a random variable. The model does not pick deterministically - it builds a probability distribution over 50,257 vocabulary entries and samples from it. An entire inference run is a joint distribution over thousands of random variables. The random variable is the mathematical instrument that turns chaos into numbers.
- **GPT inference:** each next token is a discrete RV with a distribution over 50,257 values - the softmax output
- **Loss as an RV:** the loss function depends on a random mini-batch - which is why loss fluctuates on every SGD step
- **Gradient as an RV:** in SGD the gradient is a random estimate of the true gradient over a random batch. The variance of this RV determines convergence speed
- **Dropout:** a neuron activation is a random variable (0 with probability $p$, original value / $(1-p)$ with probability $1-p$)
- **Finance:** stock return is a continuous RV - the foundation of all risk management
- **Medicine:** time-to-event (death, relapse) is a continuous RV studied in survival analysis
Kolmogorov: the mathematics of chaos
Before 1933, probability theory was "unserious" mathematics - a collection of recipes for card games. Soviet mathematician Andrey Kolmogorov changed everything with a 60-page book. He defined a random variable as a **measurable function** from the sample space to the real line. That sounds abstract, but it allowed the full machinery of mathematical analysis - integrals, limits, derivatives - to be applied to randomness. Today the loss function in PyTorch is a random variable. The gradient in SGD is a random variable. Inference in a language model is a joint distribution over thousands of random variables. Without Kolmogorov's 1933 formalism, none of this would exist.
Random Variables
**Scene:** a language model just generated the word "cat". The next token is a random variable - not just "something random", but a full mathematical object with a distribution over 50,257 values, a CDF, and concrete probabilities for each outcome.
Now a paradox from the physical world: what is the probability that a random person's height is exactly 175.000000... cm with infinite precision? **Zero.** The probability of any exact value of a continuous quantity is zero - even though a person with some height directly exists.
Both cases - GPT and human height - are described by one tool: the **random variable**. A function that maps each experimental outcome to a number and makes that number mathematically tractable.
A random variable is:
A random variable $X: \Omega \to \mathbb{R}$ is formally a function on the outcome space. It has a distribution and can be discrete or continuous.
📐 What is a random variable?
📐 What is a random variable?
A **random variable (RV)** is a function that assigns a number to each outcome of an experiment. In GPT, the next token is a discrete RV: outcomes are all 50,257 vocabulary tokens, values are their indices, distribution is the softmax over the model's logits.
Where $\Omega$ is the sample space and $\mathbb{R}$ is the real line.
🎲 Rolling a die
The simplest random variables
$\Omega = \{1, 2, 3, 4, 5, 6\}$ - all possible outcomes. **RV "number of pips":** $X(\omega) = \omega$ Simply returns what was rolled. **RV "parity":** $$Y(\omega) = \begin{cases} 1, & \text{if } \omega \text{ is even} \\ 0, & \text{if } \omega \text{ is odd} \end{cases}$$ **RV "square of pips":** $Z(\omega) = \omega^2$ Returns 1, 4, 9, 16, 25, or 36. See? From a single experiment many different random variables can be extracted!
- **Random variables** - capital letters: $X, Y, Z$ - **Specific values** - lowercase: $x, y, z$ - **$P(X = 3)$** - the probability that the RV takes the value 3 - **$P(X \leq 5)$** - the probability that the value is no greater than 5 "$X = 3$" is an **event**: the set of all outcomes where $X$ takes the value 3. In ML: "$L > 2.5$" (loss exceeded a threshold) is an event defined through the RV $L$.
Two coins are tossed. X = "number of heads". What values can X take?
Outcomes: HH (2 heads), HT (1 head), TH (1 head), TT (0 heads). $X$ can be 0, 1, or 2. This is a **discrete** random variable - it takes a finite number of values.
🔢 Discrete random variables
🔢 Discrete random variables
An RV is called **discrete** if it takes a finite or countable number of values (they can be numbered: 1st, 2nd, 3rd...).
A discrete RV is described by a **probability distribution table** - a list of values with their probabilities:
| $X$ | $x_1$ | $x_2$ | ... | $x_n$ |
|---|---|---|---|---|
| $P$ | $p_1$ | $p_2$ | ... | $p_n$ |
🎲🎲 Sum of two dice
Complete distribution
$X$ = sum of pips on two dice. Values: from 2 to 12.
| $X$ | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| $P$ | 1/36 | 2/36 | 3/36 | 4/36 | 5/36 | **6/36** | 5/36 | 4/36 | 3/36 | 2/36 | 1/36 |
**7 is the most likely value!** (6 ways: 1+6, 2+5, 3+4, 4+3, 5+2, 6+1) Check: $\frac{1+2+3+4+5+6+5+4+3+2+1}{36} = \frac{36}{36} = 1$ ✓
A coin is tossed 3 times. X = number of heads. What is P(X = 2)?
Total $2^3 = 8$ equally likely outcomes. Exactly 2 heads: HHT, HTH, THH - **3 ways**. $P(X = 2) = 3/8$ Full distribution: • $P(X=0) = 1/8$ (only TTT) • $P(X=1) = 3/8$ (HTT, THT, TTH) • $P(X=2) = 3/8$ (HHT, HTH, THH) • $P(X=3) = 1/8$ (only HHH) This is the **binomial distribution** - we'll study it later!
📈 The distribution function F(x)
📈 The distribution function F(x)
The **distribution function** is a universal way to describe any RV (discrete or continuous):
This is the probability that the RV takes a value **no greater than** $x$.
1. $0 \leq F(x) \leq 1$ - probability is always between 0 and 1 2. $F(x)$ is **non-decreasing** - larger $x$ → greater or equal $F(x)$ 3. $\lim_{x \to -\infty} F(x) = 0$ - to the left of all values 4. $\lim_{x \to +\infty} F(x) = 1$ - to the right of all values 5. **Key formula:** $P(a < X \leq b) = F(b) - F(a)$
🎲 F(x) for a die
Step function
$F(x) = P(X \leq x)$ for a die: • $F(x) = 0$ for $x < 1$ • $F(x) = 1/6$ for $1 \leq x < 2$ • $F(x) = 2/6$ for $2 \leq x < 3$ • $F(x) = 3/6$ for $3 \leq x < 4$ • $F(x) = 4/6$ for $4 \leq x < 5$ • $F(x) = 5/6$ for $5 \leq x < 6$ • $F(x) = 1$ for $x \geq 6$ ``` F(x) 1.0 ─────────────────● 5/6 ─────────────● 4/6 ─────────● 3/6 ─────● 2/6 ─● 1/6 ● └─1──2──3──4──5──6──→ x ``` A **staircase** with jumps at 1, 2, 3, 4, 5, 6.
For a die, F(3.5) = ?
$F(3.5) = P(X \leq 3.5) = P(X \in \{1, 2, 3\}) = 3/6 = 1/2$ The values 4, 5, 6 are greater than 3.5, so they are excluded. The distribution function is defined for **any** $x$, not just for values the RV can take!
〰️ Continuous random variables
〰️ Continuous random variables
An RV is **continuous** if it can take any value from an interval (not just integers or rationals - **any** value).
For a continuous RV, the probability of any **exact** value is zero: $$P(X = a) = 0$$ **Why?** Because there are infinitely many (uncountably many) values, and each receives a "zero share" of probability. **But!** This doesn't mean the event is impossible. A specific height is a concrete number, even though $P(\text{height} = 175.12345...) = 0$. The paradox resolves like this: we don't ask "exactly equal to", but rather "falls within an interval".
A continuous RV is described by a **probability density function** $f(x)$:
The density is a "concentration of probability". A high $f(x)$ means values near $x$ are more likely.
$$F(x) = \int_{-\infty}^{x} f(t)\,dt$$ $$f(x) = F'(x)$$ Density is the derivative of the distribution function. The distribution function is the integral of the density.
🚌 Uniform distribution
Bus waiting time
A bus arrives at a random moment between 0 and 10 minutes. $X \sim U(0, 10)$. **Density:** $$f(x) = \begin{cases} 1/10, & 0 \leq x \leq 10 \\ 0, & \text{otherwise} \end{cases}$$ **Probability of waiting between 2 and 5 minutes:** $$P(2 \leq X \leq 5) = \int_2^5 \frac{1}{10}\,dx = \frac{5-2}{10} = 0.3 = 30\%$$ **Geometrically:** this is the area of the rectangle under the graph of $f(x)$ from 2 to 5.
f(x) is the probability of the value x
f(x) is the probability DENSITY. On its own, it can be greater than 1!
Probability is the **area under the curve**, not the height. For example, for $U(0, 0.1)$: $f(x) = 10$ for $0 \leq x \leq 0.1$. $f(x) = 10 > 1$, but $\int_0^{0.1} 10\,dx = 1$ ✓ Density is "probability per unit length". In normalizing flows (real-NVP, Glow), the model explicitly learns $f(x)$ - and density values >> 1 are perfectly normal there.
Density of an RV: f(x) = 3x² for 0 ≤ x ≤ 1, otherwise 0. What is P(X > 0.5)?
$$P(X > 0.5) = \int_{0.5}^{1} 3x^2\,dx = x^3 \Big|_{0.5}^{1} = 1^3 - 0.5^3 = 1 - 0.125 = 0.875$$ **87.5%** of the probability is concentrated in the upper half of the interval. This is because the density $3x^2$ increases with $x$ - larger values are more likely.
For a continuous RV with density f(x), it holds that ∫f(x)dx = 1. Why?
$$\int_{-\infty}^{+\infty} f(x)\,dx = P(-\infty < X < +\infty) = 1$$ This is the probability that the RV takes **some** value. And that is always 1 - something must happen! This is the **normalization condition**: the total "area under the density curve" = 1.
🏋️ Practice
🏋️ Practice
Discrete RV X: P(X = -1) = 0.2, P(X = 0) = 0.5, P(X = 2) = 0.3. Find P(X > 0) and P(-1 < X ≤ 2).
$P(X > 0) = P(X = 2) = 0.3$ $P(-1 < X \leq 2) = P(X = 0) + P(X = 2) = 0.5 + 0.3 = 0.8$ Note: $-1 < X$ excludes the value $-1$, while $X \leq 2$ includes the value $2$.
Density of an RV: f(x) = cx² for 0 ≤ x ≤ 1, otherwise 0. Find the constant c.
$$\int_0^1 cx^2\,dx = c \cdot \frac{x^3}{3}\Big|_0^1 = \frac{c}{3} = 1$$ $$c = 3$$ Check: $\int_0^1 3x^2\,dx = x^3\Big|_0^1 = 1$ ✓
Distribution function: F(x) = 0 for x < 0, F(x) = x² for 0 ≤ x ≤ 1, F(x) = 1 for x > 1. Find the density f(x) and P(0.3 < X ≤ 0.7).
**Density:** $$f(x) = F'(x) = \begin{cases} 2x, & 0 \leq x \leq 1 \\ 0, & \text{otherwise} \end{cases}$$ **Probability:** $$P(0.3 < X \leq 0.7) = F(0.7) - F(0.3) = 0.7^2 - 0.3^2 = 0.49 - 0.09 = 0.4$$ Or via integral: $\int_{0.3}^{0.7} 2x\,dx = x^2\Big|_{0.3}^{0.7} = 0.4$
A continuous RV has density $f(x) = 2x$ on $[0,1]$ and 0 elsewhere. What is $P(X \leq 0.5)$?
$P(X \leq 0.5) = \int_0^{0.5} 2x\,dx = x^2\Big|_0^{0.5} = 0.25$. This equals $F(0.5) = (0.5)^2$ for the triangular density on $[0,1]$.
Random variables - the foundation of everything
This is the central concept on which all of probability theory and statistics is built.
- Expected Value — The "average" value of an RV - next lesson!
- Variance — A measure of the "spread" of RV values
- Named Distributions — Binomial, Poisson, normal...
- Central Limit Theorem — Why everything becomes "normal"
Key ideas
- **RV** - a function $X: \Omega \to \mathbb{R}$. The next GPT token is a discrete RV. The training loss is a continuous RV. The SGD gradient is a random estimate of the true gradient
- **Discrete RV:** finite/countable values, described by $P(X = x_i)$. The softmax output of a language model is a probability distribution over a discrete RV with 50,257 values
- **Continuous RV:** any value from an interval, $P(X = a) = 0$ - not a bug. Probability only exists for intervals: $P(a < X \leq b) = \int_a^b f(x)\,dx$
- **Distribution function:** $F(x) = P(X \leq x)$ - universal for all RVs. In model calibration: a well-calibrated model has a predicted CDF close to the diagonal on a reliability diagram
- **Density:** $f(x) = F'(x)$, can be > 1, probability = area. Normalizing flows explicitly learn $f(x)$ - that is exactly why they can compute exact likelihoods
Вопросы для размышления
- 🤯 Back to the paradox: how is "P(height = exact value) = 0" reconciled with the fact that a specific height exists?
- 📊 Which random variables in everyday life are discrete, and which are continuous? Can money be modeled as a continuous RV?
- 🎮 If sword damage in a game is a uniformly distributed RV from 10 to 20, what is the probability of dealing more than 15 damage?
- 🧬 Why do biologists often use continuous models for discrete quantities (such as the number of bacteria)?