Statistics

MLE: Why PyTorch's Cross-Entropy Loss is Fisher's 1922 Formula

PyTorch cross-entropy loss is exactly Fisher's MLE formula from 1922. Every time you call model.fit() or optimizer.step() in deep learning, you are running maximum likelihood estimation - whether you know it or not.

PyTorch cross-entropy loss = MLE for categorical distribution in disguise
Econometrics: MLE for demand estimation in pricing models
Survival analysis: Cox proportional hazards estimated via partial likelihood
Sklearn GLMs: LogisticRegression, PoissonRegressor all use MLE internally
EM algorithm: MLE for latent variable models (GMMs, HMMs)
Bayesian MAP: MLE with a prior - regularized maximum likelihood

Предварительные знания

(no prerequisites)

Estimation: The $1.5B Mistake of the Hubble Telescope

Likelihood: Flipping the Question

**2022. DeepMind releases AlphaFold2.** The system predicts the structure of 200 million proteins - literally the entire known proteome of Earth. Fifty years earlier, biologists did not know the structure of almost any protein and spent years on X-ray crystallography. How does AlphaFold2 train? The neural network learns to predict contacts between amino acids. The loss function: binary cross-entropy. **Cross-entropy loss = negative log-likelihood = exactly what Ronald Fisher described in 1922 as the "method of maximum likelihood".** Every backpropagation step in AlphaFold2 is an iteration of an idea that is 100 years old. Fisher just did not know about GPUs and transformers.

**What this lesson actually teaches**: not "how to differentiate a logarithm", but why **MLE is the unified theory behind all of ML**. Logistic regression, neural networks, language models, Gaussian mixtures - all train by maximizing likelihood. After 35 minutes it will be clear why cross-entropy loss and negative log-likelihood are the same thing, and why MLE asymptotically achieves the best possible accuracy.

Likelihood: Flipping the Question

There is probability and there is likelihood. They sound similar, but face opposite directions. **Probability** looks forward: given a distribution, what data should be expected? **Likelihood** looks backward: here is the data - what distribution generated it?

	Probability P(X\|θ)	Likelihood L(θ\|X)
What is fixed	Parameter θ	Data X
What varies	Data X	Parameter θ
Question	What data to expect given θ?	For which θ is the data most probable?
Mathematically	Function of X	Function of θ
Example	P(7 heads out of 10 \| p=0.5) = ?	L(p \| 7 heads out of 10) = ?

A Coin: 7 Heads Out of 10 Flips

Intuition for likelihood

Observation: 7 heads out of 10 flips. MLE question: which p (probability of heads) is most likely? Likelihood function: L(p) = C(10,7) * p^7 * (1-p)^3 L(0.5) = 120 * 0.5^10 ≈ 0.117 L(0.7) = 120 * 0.7^7 * 0.3^3 ≈ 0.267 <- maximum! L(0.9) = 120 * 0.9^7 * 0.1^3 ≈ 0.057 MLE answer: p_hat = 7/10 = 0.7 Intuition: "choose the parameter under which the observed data would have been most probable". That is the entire idea of MLE.

**The key reversal**: probability is a function of data at fixed θ. Likelihood is a function of θ at fixed data. The same formula, a different perspective. MLE finds the maximum over θ.

What is the key difference between probability P(X|θ) and likelihood L(θ|X)?

Log-Likelihood: Product into Sum

For n independent observations, likelihood is the **product** of individual probabilities. Products are hard to optimize: numerically unstable (underflow at large n), difficult to differentiate. The logarithm solves both problems: it turns a product into a sum.

For an i.i.d. sample X1, ..., Xn from distribution f(x|θ): L(θ) = f(X1|θ) * f(X2|θ) * ... * f(Xn|θ) = ∏ f(Xi|θ) Log-likelihood (logarithm is monotone -> maximum at the same point): ℓ(θ) = ln L(θ) = Σ ln f(Xi|θ) MLE algorithm: 1. Write L(θ) = ∏ f(Xi|θ) 2. Take the logarithm: ℓ(θ) = Σ ln f(Xi|θ) 3. Differentiate: dℓ/dθ = 0 (score equation) 4. Solve, verify it is a maximum (d²ℓ/dθ² < 0) If no closed-form solution exists - gradient ascent on ℓ(θ). This is literally what SGD does in neural networks (with a minus sign).

Why work with the log-likelihood ℓ(θ) = ln L(θ) instead of L(θ) itself?

Three Canonical Examples

X1,...,Xn ~ N(μ, σ²). Find μ̂ and σ̂². ℓ(μ,σ²) = -n/2*ln(2πσ²) - 1/(2σ²)*Σ(Xi-μ)² With respect to μ: ∂ℓ/∂μ = 1/σ²*Σ(Xi-μ) = 0 -> μ̂_MLE = X̄ With respect to σ²: ∂ℓ/∂σ² = -n/(2σ²) + Σ(Xi-μ)²/(2σ⁴) = 0 -> σ̂²_MLE = (1/n)*Σ(Xi-X̄)² <- divides by n, not n-1! MLE variance is biased (previous lesson!). MLE property: asymptotically unbiased (bias -> 0 as n -> infinity).

X1,...,Xn ~ Bernoulli(p). Observe k ones out of n. L(p) = p^k * (1-p)^(n-k) ℓ(p) = k*ln(p) + (n-k)*ln(1-p) dℓ/dp = k/p - (n-k)/(1-p) = 0 -> k(1-p) = (n-k)p -> k - kp = np - kp -> p̂_MLE = k/n (fraction of ones in the sample) Logistic regression maximizes this same likelihood, but with p_i = σ(w^T x_i). Binary cross-entropy: BCE = -1/n*Σ[yi*ln(p̂i) + (1-yi)*ln(1-p̂i)] = -1/n*ℓ(w) <- this is negative log-likelihood! Minimize BCE = Maximize ℓ(w). PyTorch BCELoss is MLE.

X1,...,Xn ~ Poisson(λ). Counts of clicks, calls, defects. f(k|λ) = λ^k * e^(-λ) / k! ℓ(λ) = (Σ Xi)*ln(λ) - n*λ - Σ ln(Xi!) dℓ/dλ = (Σ Xi)/λ - n = 0 -> λ̂_MLE = (Σ Xi)/n = X̄ CTR, RPS, MTTR: wherever a model for count data is needed - PyTorch PoissonNLLLoss maximizes this likelihood.

**The unified theory behind ML**: Gaussian NLL -> MSE loss (regression). Bernoulli NLL -> Binary Cross-Entropy (classification). Categorical NLL -> Cross-Entropy (softmax). Laplace NLL -> MAE loss (robust regression). Every time a loss function is chosen, a distributional assumption about noise is implicitly chosen as well.

Which neural-network loss is equivalent to maximizing the Bernoulli likelihood (binary classification)?

Properties of MLE: Why It Works So Well

MLE has a set of properties that make it the default choice for parameter estimation. All of them are asymptotic - they hold for sufficiently large n - but in most practical tasks n is large enough.

Property	Meaning	Practical consequence
Consistency	θ̂_MLE ->_P θ as n->∞	Given enough data, MLE will find the correct parameter
Asymptotic normality	√n(θ̂-θ) -> N(0, I⁻¹(θ))	Confidence intervals are built from formulas
Asymptotic efficiency	Achieves the Cramer-Rao bound	No unbiased estimator has lower variance
Invariance	g(θ̂) is MLE for g(θ)	MLE for σ gives MLE for σ², for e^μ, etc.

Invariance: The Most Useful Property

No need to solve the problem from scratch

If θ̂_MLE = X̄ for μ of the normal distribution, then: MLE for e^μ = e^(X̄) (no new computation!) MLE for 1/σ = 1/σ̂ (same principle) MLE for μ/σ = X̄/σ̂ (standardized mean) Example from ML: if MLE for the logit logit(p) = log(p/(1-p)), then MLE for p = sigmoid(logit_hat) - this is why logistic regression gives MLE for probabilities directly.

Which property of MLE lets us get the MLE for any function g(μ) once we have the MLE for μ - without new optimization?

Where MLE Lives Right Now

**Final frame**: when PyTorch runs loss.backward() with cross-entropy loss, it executes Fisher's 1922 algorithm via automatic differentiation on GPU. Gradient descent over parameters is numerical MLE. Choosing a loss function is choosing a distribution. Understanding this connection shifts the intuition about ML: not "we minimize error", but "we maximize the likelihood of the data under the model".

From a statistical viewpoint, what does `loss.backward()` in PyTorch do when the loss is cross-entropy?

Practice: MLE for the Gaussian from Scratch

For a normal sample X₁,...,Xₙ ~ N(μ, σ²), what is the MLE for σ²?

Key Takeaways

**Likelihood L(θ|X)** is the probability of data as a function of the parameter: data are fixed, θ varies
**MLE = argmax L(θ)**: choose the parameter under which the observed data are most probable
**Algorithm**: L(θ) -> log L(θ) = Σ ln f(Xi|θ) -> dℓ/dθ = 0 -> verify it is a maximum
**Cross-entropy = NLL**: minimizing BCE in PyTorch = maximizing Bernoulli log-likelihood = MLE
**Properties**: consistent, asymptotically normal, asymptotically efficient, invariant
**Unified theory**: regression (MSE = Gaussian NLL), classification (BCE = Bernoulli NLL), LLM (CE = Categorical NLL) - MLE everywhere

What's Next

MLE gives a point estimate. Next - how to express uncertainty.

Confidence Intervals — Asymptotic normality of MLE directly yields confidence intervals via Fisher information
Bayesian Approach — MLE = MAP with a uniform prior. Bayesian inference adds a prior - a step from MLE to the posterior
EM Algorithm — MLE for models with latent variables (GMM, HMM) via iterative E and M steps
Logistic Regression — MLE for Bernoulli with parameter p = σ(w^T x) - the baseline classifier in ML

Предварительные знания

Likelihood: Flipping the Question

Likelihood: Flipping the Question

A Coin: 7 Heads Out of 10 Flips

Log-Likelihood: Product into Sum

Log-Likelihood: Product into Sum

Three Canonical Examples

Three Canonical Examples

Properties of MLE: Why It Works So Well

Properties of MLE: Why It Works So Well

Invariance: The Most Useful Property

Where MLE Lives Right Now

Where MLE Lives Right Now

Practice: MLE for the Gaussian from Scratch

Practice: MLE for the Gaussian from Scratch

Key Takeaways

What's Next

Связанные уроки