Probability Theory
What is Probability?
In 1654, Pascal and Fermat exchanged letters about gambling. From that correspondence, probability theory was born. 370 years later, the same formulas choose ChatGPT's next token, flag fraud at Stripe, and govern the loss function of every AI lab.
- ChatGPT: every token is a sample from a distribution over ~200,000 words. The temperature parameter in every API is raw probability theory.
- Gmail spam filter: Naive Bayes computes $P(spam|words)$ in microseconds, billions of times per day.
- Stripe fraud detection: $P(fraud|transaction)$ - a single probability number blocks or approves the payment.
Kolmogorov Axioms: Three Rules for Everything
In 1654, Pascal and Fermat exchanged letters about gambling. Chevalier de Méré asked Pascal: «How is a pot fairly split if the game is stopped?» From that correspondence, probability theory was born. 370 years later, the same formulas choose ChatGPT's next token, flag fraud at Stripe, and govern the loss function of every AI lab.
Before computing any probability, the **set of all possible outcomes** is required. We call it $\Omega$ (omega) - the sample space. An **event** $A$ is any subset of $\Omega$.
| Experiment | $\Omega$ | Example event $A$ |
|---|---|---|
| Flip a coin | $\{heads, tails\}$, size 2 | «heads» |
| Roll a die | $\{1,2,3,4,5,6\}$, size 6 | «even number» = $\{2,4,6\}$ |
| Two dice | all pairs $(i,j)$, size 36 | «sum = 7» = 6 pairs |
| GPT-4 next token | vocabulary, size ~100,000 | «a positive-sentiment word» |
In 1933 Andrey Kolmogorov stopped the philosophical debate about the nature of probability with one move: «Let's stop arguing. Let's agree on the **rules** that any probability must obey - and derive everything from those rules.»
**Kolmogorov's three axioms:** **1. Non-negativity:** $P(A) \geq 0$ **2. Normalization:** $P(\Omega) = 1$ - something in $\Omega$ must happen **3. Additivity:** for mutually exclusive events $A \cap B = \emptyset$: $$P(A \cup B) = P(A) + P(B)$$ All of probability theory - Bayes' theorem, expectation, information theory, neural-network cross-entropy loss - follows from these three lines.
A theorem from the axioms: $P(\bar{A}) = 1 - P(A)$
The complement trick
Event $A$ and its complement $\bar{A}$ cover $\Omega$ and do not overlap. By axiom 3: $P(A) + P(\bar{A}) = P(\Omega) = 1$. Therefore: $P(\bar{A}) = 1 - P(A)$. **This is the single most used trick in probability.** Example: «probability the server is up all day» = $1 -$ «probability it crashes at least once».
**Classical probability** (works when outcomes are equally likely): $P(A) = |A| / |\Omega|$. Sum = 7 with two dice: 6 pairs out of 36, $P = 6/36 \approx 16.7\%$. Sum = 12: only $(6,6)$, $P = 1/36 \approx 2.8\%$. That is where craps rules come from.
We roll two dice. How many elementary outcomes are in $\Omega$?
Frequentist vs Bayesian Probability
Kolmogorov's axioms define the **rules** for probability. But what does it **mean**? Physicists, statisticians, and engineers have disagreed on this for 300 years.
- Frequentist — Probability = limiting frequency over many repetitions. $P(A) = \lim_{n \to \infty} n_A / n$ - Objective: does not depend on the observer - Only works for repeatable experiments - Cannot be applied to one-off events Examples: estimating CTR, classifier accuracy, Monte Carlo
- Bayesian — Probability = degree of belief of an observer. $P(A \mid \text{data})$ - a belief that updates with evidence - Subjective: different priors give different answers - Works for one-off events - Allows incorporating expert knowledge Examples: spam filter, medical diagnosis, A/B test with a small sample
Where the views diverge
One question, two answers
**Question**: «Probability of rain tomorrow = 0.7» **Frequentist**: «Probability cannot be assigned to a specific tomorrow - it is a one-off event, not a series of experiments. I can say: in 70% of similar meteorological situations, it rained.» **Bayesian**: «0.7 is my current degree of belief given current observations. Show me new data and I will update to P(rain | new data).» **Question**: «Probability that Argentina wins the 2026 World Cup» **Frequentist**: «The question is ill-formed - we cannot repeat this World Cup a million times.» **Bayesian**: «0.15 - my belief based on FIFA rankings, squad form, and the draw.»
**In ML both views coexist:** - Classifier accuracy on a test set - frequentist probability - Posterior $P(\theta \mid \text{data})$ in Bayesian learning - Bayesian - Naive Bayes spam filter - Bayesian name, often trained frequentistically - Dropout at inference as Bayesian approximation - Bayesian interpretation of a frequentist method
A coin was flipped 1000 times; heads appeared 473 times. What is the correct interpretation?
Conditional Probability and Independence
Conditional probability $P(A \mid B)$ is the probability of event $A$ given that $B$ has already occurred. It is an update of information.
Law of total probability
The basic tool of Bayesian reasoning
If $B_1, B_2, \ldots, B_n$ partition $\Omega$ (disjoint, cover everything), then: $$P(A) = \sum_i P(A \mid B_i) \cdot P(B_i)$$ **Example**: a spam filter. $B_1$ = email is spam, $B_2$ = not spam. $P(\text{word 'credit'} \mid B_1) = 0.30$, $P(B_1) = 0.20$ $P(\text{word 'credit'} \mid B_2) = 0.02$, $P(B_2) = 0.80$ $P(\text{word 'credit'}) = 0.30 \cdot 0.20 + 0.02 \cdot 0.80 = 0.076$
**Independence**: events $A$ and $B$ are independent if $P(A \mid B) = P(A)$, meaning knowledge of $B$ does not change the probability of $A$. Equivalently: $P(A \cap B) = P(A) \cdot P(B)$.
**Naive Bayes** - the fastest spam filter in existence - makes one strong assumption: words in an email are independent of each other (they are not, but it works remarkably well). $P(\text{spam} \mid w_1, w_2, \ldots) \propto P(\text{spam}) \prod_i P(w_i \mid \text{spam})$. Computed in microseconds, billions of times per day.
**The base rate fallacy**: a test with 99% accuracy sounds excellent. But if the disease is rare (1%), a positive test means actual disease in only 16% of cases. This is not an error - it is conditional probability. That is why medical screenings require a confirmatory test.
The weather app says $P(\text{rain}) = 0.4$. What is the probability of no rain?
Summary
- $\Omega$ - sample space, events $A \subseteq \Omega$ are subsets of interest
- Kolmogorov's three axioms: non-negativity, normalization, additivity - all of probability from these
- Complement: $P(\bar{A}) = 1 - P(A)$ - the most used trick in the field
- Classical probability: $P(A) = |A| / |\Omega|$ for equally likely outcomes
- Frequentist vs Bayesian: objective limiting frequency vs subjective degree of belief
- Conditional probability: $P(A|B) = P(A \cap B) / P(B)$ - information update
What this unlocks
Three axioms and what they build:
- Combinatorics — Counting outcomes systematically - needed for anything beyond 36 dice pairs
- Conditional probability and Bayes' theorem — Update beliefs with evidence - spam filter, medical test, Bayesian A/B test
- Random variables — Probability with numbers: expectation, variance - the workhorses of statistics
Вопросы для размышления
- Why does intuition about randomness fail so often? Hint: we think in stories, not in sample spaces.
- «30% chance of rain» - what does it mean operationally for a frequentist and for a Bayesian?
- At what temperature does an LLM become deterministic? At what temperature does it become random noise?