Statistics
MLE: Why PyTorch's Cross-Entropy Loss is Fisher's 1922 Formula
PyTorch cross-entropy loss is exactly Fisher's MLE formula from 1922. Every time you call model.fit() or optimizer.step() in deep learning, you are running maximum likelihood estimation - whether you know it or not.
- PyTorch cross-entropy loss = MLE for categorical distribution in disguise
- Econometrics: MLE for demand estimation in pricing models
- Survival analysis: Cox proportional hazards estimated via partial likelihood
- Sklearn GLMs: LogisticRegression, PoissonRegressor all use MLE internally
- EM algorithm: MLE for latent variable models (GMMs, HMMs)
- Bayesian MAP: MLE with a prior - regularized maximum likelihood
Предварительные знания
- (no prerequisites)
Likelihood: Flipping the Question
**2022. DeepMind releases AlphaFold2.** The system predicts the structure of 200 million proteins - literally the entire known proteome of Earth. Fifty years earlier, biologists did not know the structure of almost any protein and spent years on X-ray crystallography. How does AlphaFold2 train? The neural network learns to predict contacts between amino acids. The loss function: binary cross-entropy. **Cross-entropy loss = negative log-likelihood = exactly what Ronald Fisher described in 1922 as the "method of maximum likelihood".** Every backpropagation step in AlphaFold2 is an iteration of an idea that is 100 years old. Fisher just did not know about GPUs and transformers.
**What this lesson actually teaches**: not "how to differentiate a logarithm", but why **MLE is the unified theory behind all of ML**. Logistic regression, neural networks, language models, Gaussian mixtures - all train by maximizing likelihood. After 35 minutes it will be clear why cross-entropy loss and negative log-likelihood are the same thing, and why MLE asymptotically achieves the best possible accuracy.
Likelihood: Flipping the Question
There is probability and there is likelihood. They sound similar, but face opposite directions. **Probability** looks forward: given a distribution, what data should be expected? **Likelihood** looks backward: here is the data - what distribution generated it?
| Probability P(X|θ) | Likelihood L(θ|X) | |
|---|---|---|
| What is fixed | Parameter θ | Data X |
| What varies | Data X | Parameter θ |
| Question | What data to expect given θ? | For which θ is the data most probable? |
| Mathematically | Function of X | Function of θ |
| Example | P(7 heads out of 10 | p=0.5) = ? | L(p | 7 heads out of 10) = ? |
A Coin: 7 Heads Out of 10 Flips
Intuition for likelihood
Observation: 7 heads out of 10 flips. MLE question: which p (probability of heads) is most likely? Likelihood function: L(p) = C(10,7) * p^7 * (1-p)^3 L(0.5) = 120 * 0.5^10 ≈ 0.117 L(0.7) = 120 * 0.7^7 * 0.3^3 ≈ 0.267 <- maximum! L(0.9) = 120 * 0.9^7 * 0.1^3 ≈ 0.057 MLE answer: p_hat = 7/10 = 0.7 Intuition: "choose the parameter under which the observed data would have been most probable". That is the entire idea of MLE.
**The key reversal**: probability is a function of data at fixed θ. Likelihood is a function of θ at fixed data. The same formula, a different perspective. MLE finds the maximum over θ.
What is the key difference between probability P(X|θ) and likelihood L(θ|X)?
Log-Likelihood: Product into Sum
Log-Likelihood: Product into Sum
For n independent observations, likelihood is the **product** of individual probabilities. Products are hard to optimize: numerically unstable (underflow at large n), difficult to differentiate. The logarithm solves both problems: it turns a product into a sum.
For an i.i.d. sample X1, ..., Xn from distribution f(x|θ): L(θ) = f(X1|θ) * f(X2|θ) * ... * f(Xn|θ) = ∏ f(Xi|θ) Log-likelihood (logarithm is monotone -> maximum at the same point): ℓ(θ) = ln L(θ) = Σ ln f(Xi|θ) MLE algorithm: 1. Write L(θ) = ∏ f(Xi|θ) 2. Take the logarithm: ℓ(θ) = Σ ln f(Xi|θ) 3. Differentiate: dℓ/dθ = 0 (score equation) 4. Solve, verify it is a maximum (d²ℓ/dθ² < 0) If no closed-form solution exists - gradient ascent on ℓ(θ). This is literally what SGD does in neural networks (with a minus sign).
Why work with the log-likelihood ℓ(θ) = ln L(θ) instead of L(θ) itself?
Three Canonical Examples
Three Canonical Examples
X1,...,Xn ~ N(μ, σ²). Find μ̂ and σ̂². ℓ(μ,σ²) = -n/2*ln(2πσ²) - 1/(2σ²)*Σ(Xi-μ)² With respect to μ: ∂ℓ/∂μ = 1/σ²*Σ(Xi-μ) = 0 -> μ̂_MLE = X̄ With respect to σ²: ∂ℓ/∂σ² = -n/(2σ²) + Σ(Xi-μ)²/(2σ⁴) = 0 -> σ̂²_MLE = (1/n)*Σ(Xi-X̄)² <- divides by n, not n-1! MLE variance is biased (previous lesson!). MLE property: asymptotically unbiased (bias -> 0 as n -> infinity).
X1,...,Xn ~ Bernoulli(p). Observe k ones out of n. L(p) = p^k * (1-p)^(n-k) ℓ(p) = k*ln(p) + (n-k)*ln(1-p) dℓ/dp = k/p - (n-k)/(1-p) = 0 -> k(1-p) = (n-k)p -> k - kp = np - kp -> p̂_MLE = k/n (fraction of ones in the sample) Logistic regression maximizes this same likelihood, but with p_i = σ(w^T x_i). Binary cross-entropy: BCE = -1/n*Σ[yi*ln(p̂i) + (1-yi)*ln(1-p̂i)] = -1/n*ℓ(w) <- this is negative log-likelihood! Minimize BCE = Maximize ℓ(w). PyTorch BCELoss is MLE.
X1,...,Xn ~ Poisson(λ). Counts of clicks, calls, defects. f(k|λ) = λ^k * e^(-λ) / k! ℓ(λ) = (Σ Xi)*ln(λ) - n*λ - Σ ln(Xi!) dℓ/dλ = (Σ Xi)/λ - n = 0 -> λ̂_MLE = (Σ Xi)/n = X̄ CTR, RPS, MTTR: wherever a model for count data is needed - PyTorch PoissonNLLLoss maximizes this likelihood.
**The unified theory behind ML**: Gaussian NLL -> MSE loss (regression). Bernoulli NLL -> Binary Cross-Entropy (classification). Categorical NLL -> Cross-Entropy (softmax). Laplace NLL -> MAE loss (robust regression). Every time a loss function is chosen, a distributional assumption about noise is implicitly chosen as well.
Which neural-network loss is equivalent to maximizing the Bernoulli likelihood (binary classification)?
Properties of MLE: Why It Works So Well
Properties of MLE: Why It Works So Well
MLE has a set of properties that make it the default choice for parameter estimation. All of them are asymptotic - they hold for sufficiently large n - but in most practical tasks n is large enough.
| Property | Meaning | Practical consequence |
|---|---|---|
| Consistency | θ̂_MLE ->_P θ as n->∞ | Given enough data, MLE will find the correct parameter |
| Asymptotic normality | √n(θ̂-θ) -> N(0, I⁻¹(θ)) | Confidence intervals are built from formulas |
| Asymptotic efficiency | Achieves the Cramer-Rao bound | No unbiased estimator has lower variance |
| Invariance | g(θ̂) is MLE for g(θ) | MLE for σ gives MLE for σ², for e^μ, etc. |
Invariance: The Most Useful Property
No need to solve the problem from scratch
If θ̂_MLE = X̄ for μ of the normal distribution, then: MLE for e^μ = e^(X̄) (no new computation!) MLE for 1/σ = 1/σ̂ (same principle) MLE for μ/σ = X̄/σ̂ (standardized mean) Example from ML: if MLE for the logit logit(p) = log(p/(1-p)), then MLE for p = sigmoid(logit_hat) - this is why logistic regression gives MLE for probabilities directly.
Which property of MLE lets us get the MLE for any function g(μ) once we have the MLE for μ - without new optimization?
Where MLE Lives Right Now
Where MLE Lives Right Now
**Final frame**: when PyTorch runs loss.backward() with cross-entropy loss, it executes Fisher's 1922 algorithm via automatic differentiation on GPU. Gradient descent over parameters is numerical MLE. Choosing a loss function is choosing a distribution. Understanding this connection shifts the intuition about ML: not "we minimize error", but "we maximize the likelihood of the data under the model".
From a statistical viewpoint, what does `loss.backward()` in PyTorch do when the loss is cross-entropy?
Practice: MLE for the Gaussian from Scratch
Practice: MLE for the Gaussian from Scratch
For a normal sample X₁,...,Xₙ ~ N(μ, σ²), what is the MLE for σ²?
Key Takeaways
- **Likelihood L(θ|X)** is the probability of data as a function of the parameter: data are fixed, θ varies
- **MLE = argmax L(θ)**: choose the parameter under which the observed data are most probable
- **Algorithm**: L(θ) -> log L(θ) = Σ ln f(Xi|θ) -> dℓ/dθ = 0 -> verify it is a maximum
- **Cross-entropy = NLL**: minimizing BCE in PyTorch = maximizing Bernoulli log-likelihood = MLE
- **Properties**: consistent, asymptotically normal, asymptotically efficient, invariant
- **Unified theory**: regression (MSE = Gaussian NLL), classification (BCE = Bernoulli NLL), LLM (CE = Categorical NLL) - MLE everywhere
What's Next
MLE gives a point estimate. Next - how to express uncertainty.
- Confidence Intervals — Asymptotic normality of MLE directly yields confidence intervals via Fisher information
- Bayesian Approach — MLE = MAP with a uniform prior. Bayesian inference adds a prior - a step from MLE to the posterior
- EM Algorithm — MLE for models with latent variables (GMM, HMM) via iterative E and M steps
- Logistic Regression — MLE for Bernoulli with parameter p = σ(w^T x) - the baseline classifier in ML