Probability Theory

Conditional Expectation

Netflix paid one million dollars for a recommendation algorithm - and at the heart of the finalist solution is one formula: E[rating|user_profile]. The Kalman filter in SpaceX navigation computes E[position|sensor_measurements] every 10 milliseconds. Gradient Boosting, winner of the Criteo CTR Challenge 2014, builds trees for E[residual|features]. Conditional expectation is not an abstract textbook concept - it is a working tool generating billions of dollars.

Netflix Recommender: E[rating|user_profile] for every user-movie pair
Kalman filter: E[state|observations] for navigation and control
GBDT: each tree = E[residual|features], Criteo CTR dataset
Bayesian statistics: E[theta|data] as a point estimate of a parameter

Предварительные знания

Conditional probability P(A|B) and conditional distributions
Expectation E[X] for discrete and continuous random variables
Joint distributions, marginalization

E[X|Y] as a Random Variable

E[X|Y=y] is a number, E[X|Y] is a random variable

The distinction is fundamental. Let X be a movie rating (1-5) and Y the genre. Then E[X|Y=drama] = 3.8 - a specific number for the genre 'drama'. But E[X|Y] is a random variable: it takes the value 3.8 when genre is drama, 4.1 when genre is thriller, 2.9 when genre is horror. This quantity lives on the same probability space as Y.

X = exam score, Y = department (1=math, 2=humanities) P(Y=1) = 0.4, P(Y=2) = 0.6 P(X=5|Y=1) = 0.6, P(X=3|Y=1) = 0.4 P(X=5|Y=2) = 0.3, P(X=3|Y=2) = 0.7 E[X|Y=1] = 5*0.6 + 3*0.4 = 3.0 + 1.2 = 4.2 E[X|Y=2] = 5*0.3 + 3*0.7 = 1.5 + 2.1 = 3.6 E[X|Y] as a random variable: takes value 4.2 with probability 0.4 takes value 3.6 with probability 0.6

Continuous case

For continuous X and Y the joint density $f_{X,Y}(x,y)$ yields the conditional density $f_{X|Y}(x|y) = f_{X,Y}(x,y) / f_Y(y)$. Then $E[X|Y=y] = \int x \cdot f_{X|Y}(x|y) \, dx$. If X and Y are jointly normal with correlation $\rho$, then $E[X|Y=y] = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y)$ - a linear function of y.

For a jointly normal vector $(X, Y)$ with zero means: $E[X|Y=y] = \frac{Cov(X,Y)}{Var(Y)} \cdot y$. The coefficient $Cov(X,Y)/Var(Y)$ is precisely the slope in the linear regression of X on Y.

Numpy: computing E[X|Y] via simulation

X is server response time (ms), Y is requests per second. E[X|Y=100] = 450. What does E[X|Y] mean as a random variable?

Tower Property

Tomorrow's weather forecast depends only on today's conditions - not on conditions from a month ago. Formally: if a conditional expectation is taken with respect to two variables Y and Z, then averaging again over Y erases the extra information Z.

Three-step proof

Numerical example

X = revenue (thousand USD), Y = day (weekday/weekend), Z = weather (sun/rain) P(Y=weekday) = 5/7, P(Y=weekend) = 2/7 P(Z=sun|Y=weekday) = 0.7, P(Z=rain|Y=weekday) = 0.3 P(Z=sun|Y=weekend) = 0.6, P(Z=rain|Y=weekend) = 0.4 E[X|Y=weekday, Z=sun] = 80 E[X|Y=weekday, Z=rain] = 50 E[X|Y=weekend, Z=sun] = 150 E[X|Y=weekend, Z=rain] = 90 E[X|Y=weekday] = 80*0.7 + 50*0.3 = 56 + 15 = 71 E[X|Y=weekend] = 150*0.6 + 90*0.4 = 90 + 36 = 126 E[X] = 71*(5/7) + 126*(2/7) = 50.7 + 36.0 = 86.7 thousand USD

**Hierarchical models in Bayesian ML** rely on the tower property constantly. If theta ~ Prior and X|theta ~ Likelihood, then E[X] = E[E[X|theta]] = E[mu(theta)] - the mean over parameters of conditional means. Stan and PyMC3 compute such hierarchical expectations via MCMC.

Empirical verification in numpy

A visitor lands on a website. P(purchase) is unknown. Given: if the visitor saw the ad Y=1, then E[purchase|Y=1]=0.15; if not Y=0, then E[purchase|Y=0]=0.04. P(Y=1)=0.3. What is E[purchase]?

Law of Total Variance

An A/B test is running: variant A converts at 5%, variant B at 8%. Some users see A, others see B. The total variance of conversion consists of two parts: variance within each group (within-group) plus variance between group means (between-group). This is the law of total variance.

Proof via decomposition

A/B test: numerical example

X = conversion (0 or 1), Y = variant (A or B) P(Y=A) = 0.5, P(Y=B) = 0.5 p_A = E[X|Y=A] = 0.05 (conversion rate in group A) p_B = E[X|Y=B] = 0.08 (conversion rate in group B) Var(X|Y=A) = 0.05*0.95 = 0.0475 (Bernoulli) Var(X|Y=B) = 0.08*0.92 = 0.0736 Within-group: E[Var(X|Y)] = 0.0475*0.5 + 0.0736*0.5 = 0.0606 E[X] = 0.05*0.5 + 0.08*0.5 = 0.065 E[X|Y] takes values {0.05, 0.08} with prob {0.5, 0.5} Var(E[X|Y]) = (0.05-0.065)^2*0.5 + (0.08-0.065)^2*0.5 = 0.000225*0.5 + 0.000225*0.5 = 0.000225 Between-group: Var(E[X|Y]) = 0.000225 Var(X) = 0.0606 + 0.000225 = 0.0608 Check: E[X]*(1-E[X]) = 0.065*0.935 = 0.0608 Between / Total = 0.000225 / 0.0608 = 0.37% - treatment effect is tiny Within / Total = 99.63% - Bernoulli noise dominates

This is the probabilistic foundation of **ANOVA** (Analysis of Variance): SS_total = SS_within + SS_between. In sklearn: the R^2 metric equals 1 - Var(residuals)/Var(y) = Between/(Between+Within) under perfect grouping.

Var(X) = 100. After observing Y it turns out Var(E[X|Y]) = 60. What is E[Var(X|Y)]? What does this mean?

Bayes Optimal Predictor and ML Applications

**Gradient boosting on the Criteo dataset (11 billion rows, CTR prediction) won Kaggle 2014.** Each tree in GBDT computes E[residual|features] - the conditional expectation of residuals. This is not a coincidence: theory guarantees that E[Y|X=x] is the best predictor of Y from X in the MSE sense.

Proof of optimality

Gradient Boosting as iterative computation of E[X|Y]

GradientBoostingRegressor from scikit-learn builds at each step a tree predicting the negative gradient of MSE loss. Under MSE loss the gradient equals -(y - f(x)), i.e. the residual. Each tree approximates E[residual|features] - the conditional expectation of the residual given features. By summing trees the model converges toward E[Y|X].

**Irreducible error** in this example is the noise variance noise=30, i.e. 30^2 = 900. The achieved MSE=929 is close to the theoretical minimum E[Var(Y|X)] = 900. GBDT has nearly reached the Bayes optimum.

Interview: conditional expectation in ML interviews

**Q: A neural network is trained with MSE loss. What does the prediction approximate at convergence?** - At the global MSE minimum the network approximates E[Y|X=x]. This is the Bayes optimal predictor - the best achievable by any model. Irreducible error = E[Var(Y|X)] - noise that cannot be removed regardless of model complexity.
**Q: Explain the law of total variance in the context of A/B testing.** - Within-group: random Bernoulli noise in conversion within each variant. Between-group: difference in mean conversion rates between A and B. The t-statistic is related to the between-group component. User stratification reduces within-group noise.
**Q: E[X|Y] and E[Y|X] are different. When do they coincide?** - When X and Y are identically distributed and the joint distribution is symmetric about the diagonal. Example: X=Y gives E[X|Y=y] = y = E[Y|X=y]. In general they are different functions - confusing them is a common mistake in interviews.

Given: model MSE = 150, Var(Y) = 500. What is R^2? What does E[Var(Y|X)] = 100 mean in this context?