Statistics
Logistic Regression: Probability as a Curved Line
Logistic regression could have prevented the Challenger disaster in 1986: all 23 launch records gave P(failure) > 0.99 at 28 degrees F. The tool existed. No one used it. Today that same tool powers credit scoring, spam filters, and medical diagnostics worldwide.
- Credit scoring: FICO and bank scorecards use logistic regression as the baseline
- Spam filters: Gmail logistic regression on 100+ features as the first layer
- CTR prediction: Facebook Ads LR trained on billions of examples in real time
- Medical diagnostics: disease probability from biomarker values
- A/B testing: logistic regression for conversion rate with covariate adjustment
- Sklearn LogisticRegression: 5 solvers, L1/L2/ElasticNet regularization built in
**January 28, 1986. Cape Canaveral. 28°F (-2°C).** Challenger disintegrates 73 seconds after liftoff. Seven astronauts die. The Presidential Commission will later establish: the night before launch, Morton Thiokol engineers had field data on O-ring behavior at various temperatures. They plotted a scatter chart - but only for flights that had experienced anomalies. Data from incident-free launches was excluded from the analysis. Had they fit a logistic regression P(failure | temperature) to all 23 prior launches - the S-curve would have shown P > 0.99 at -2°C. **The tool existed. No one reached for it.** Logistic regression could have saved seven lives.
**What this lesson actually teaches**: not yet another classification formula, but why binary outcomes require a dedicated framework - and how the logit link solves the problem OLS fundamentally cannot. After this lesson the mental map reads: logit -> odds ratio -> MLE -> ROC/AUC -> regularization. This is the foundation behind credit scoring, spam filtering, medical diagnostics, and CTR prediction.
The Problem: Why OLS Fails for Probabilities
**Ordinary Least Squares (OLS)** is designed for a continuous response Y ∈ (-∞, +∞). When Y is binary - 0 or 1 - a fundamental contradiction emerges. The model Ŷ = Xβ can predict any real number, including negative values and values above 1. For a probability, this is meaningless.
Предварительные знания
**LPM (Linear Probability Model)** is sometimes used in econometrics deliberately - it applies OLS to binary Y for interpretability when effects are small and probabilities stay near the center. This is a conscious trade-off, not a recommendation. Near P = 0 or P = 1, LPM produces out-of-range predictions and structurally incorrect confidence intervals.
The Logit Link: An Elegant Solution
What is needed is a link function that maps the linear predictor (-∞, +∞) into a probability (0, 1). The **logistic function (sigmoid)** σ(z) = 1/(1+e^(-z)) does exactly that: monotonically increasing, strictly bounded between 0 and 1, symmetric around 0.5. Its inverse - the **logit** - maps a probability back to log-odds, which live on the entire real line.
**Sigmoid intuition**: at z = 0 the model is maximally uncertain (p = 0.5). The larger |z|, the more confident the prediction. At z = +4.6, p ≈ 0.99. At z = -4.6, p ≈ 0.01. Gradient descent moves weights so that sigmoid outputs for correct classes approach 1 and 0.
Interpreting Coefficients via Odds Ratios
Coefficient β_j in logistic regression means: a one-unit increase in x_j (all others fixed) changes log-odds by β_j. This is not intuitive. The more natural interpretation is the **Odds Ratio (OR)**: exp(β_j) tells how many times the odds are multiplied per unit change in the predictor.
**OR != Relative Risk (RR).** Odds Ratio is always further from 1 than Relative Risk, and the gap matters substantially when baseline probability exceeds 0.10. A classic error in medical literature: interpreting OR as RR. Example: OR = 3.0 at baseline p = 0.5 means RR = 1.5, not 3.0. Confidence intervals for OR via the delta method or bootstrap are standard in clinical reporting.
MLE and Binary Cross-Entropy: One Function, Two Names
Logistic regression estimates parameters through **Maximum Likelihood Estimation (MLE)**. The likelihood for binary data is the product of p_i^y_i * (1-p_i)^(1-y_i) across all observations. The log-likelihood is a sum, and that is what gets maximized. Negated and normalized, it equals **binary cross-entropy** - the same loss function minimized by neural networks in binary classification.
**Connection to neural networks**: sklearn LogisticRegression uses L-BFGS (quasi-Newton). PyTorch/JAX use SGD or Adam. For n < 10K samples L-BFGS converges faster and more precisely. At n > 100K mini-batch SGD is preferable due to memory. This is the same reason deep learning frameworks use Adam rather than L-BFGS.
Evaluating Quality: Beyond Accuracy
**Accuracy** is the most misleading metric under **class imbalance**. If 99% of samples are negative (fraud detection: 99% of transactions are legitimate), a model that always predicts 0 achieves 99% accuracy and zero utility. The right evaluation tools are built on the confusion matrix and probabilistic thresholds.
**Calibration vs Discrimination.** AUC-ROC measures ranking ability (discrimination) - whether positives score higher than negatives. Calibration measures whether p=0.7 actually means a 70% empirical probability. Models can have high AUC but poor calibration (neural networks are notorious for this). Platt scaling and isotonic regression are the two classical post-training calibration methods. In credit scoring, calibration is critical: p=0.15 should mean a 15% real default rate.
Regularization: Ridge (L2) and Lasso (L1)
Without regularization, logistic regression overfits when features are many - especially under multicollinearity. In sklearn **C = 1/λ**: smaller C means stronger penalty on large coefficients. Default C=1.0 is moderate L2 regularization. The choice of regularization type controls model behavior on sparse data.
**In production** regularization is almost always necessary. Especially when: 1. n < 1000 with many features 2. high-cardinality categoricals after one-hot encoding 3. NLP features where vocabulary size >> n. Example: spam filter with 50K word features and 10K emails - without L1, the model overfits and half the coefficients are statistically unreliable.
Where Logistic Regression Lives Right Now
**Logistic Regression in Production (2026)** Despite the deep learning boom, logistic regression remains the core of many critical production systems because of interpretability, speed, and regulatory requirements **Credit scoring (FICO, banks)** - Predicting P(default) from applicant financial history. Regulators (Basel III, GDPR) require explainability. Odds ratios are the only way to explain a rejection to a customer. In the US, the Fair Credit Reporting Act explicitly requires human-readable reasons for adverse action. **Spam detection (Gmail baseline)** - P(spam | features) on header + body features. The first generation of Gmail spam filter was logistic regression with L1 on bag-of-words features. Still used as a fast baseline and interpretability layer above BERT-based systems. **Medical diagnosis (FDA-cleared models)** - P(cancer | imaging features) or P(sepsis | vital signs). The majority of FDA-cleared AI diagnostic models are interpretable (LR, decision trees), not black-box neural networks. SOFA score in ICUs is a weighted logistic regression. **Ad CTR prediction (feature engineering layer)** - P(click | user, ad, context) in real-time bidding. Google, Meta - LR over billions of binary features with L1 regularization. Neural networks are added on top as the deep component in DLRM architecture, but LR remains the memorization layer. **Churn prediction (SaaS)** - P(cancellation | usage patterns) from product metrics. Salesforce Einstein, HubSpot - LR with feature importance for Customer Success teams. OR per feature explains to the team why a specific customer is at risk. **A/B tests with binary outcome** - Conversion rate analysis with covariate control. Logistic regression with treatment as predictor is the correct approach for A/B analysis on binary outcomes (over chi-square). CUPED for variance reduction extends this same idea.
**Final frame**: in 2026, when every startup is adding an LLM, 95% of production binary classification decisions in regulated industries are still made by logistic regression. Not because neural networks are inferior, but because regulators require each decision to be explained. OR = 0.30 for 'has_default' explains a loan rejection to a customer. A 128-layer transformer cannot.
Practice: The Challenger Disaster on Real Data
Interview prep - key logistic regression questions: **Q: Dataset: 99.5% of transactions are legitimate, 0.5% are fraud. A model that always predicts 0 achieves 99.5% accuracy. The stakeholder says: 'Great result!' How would the problem be explained, and which metrics should be used instead?** Key points: Accuracy under severe imbalance is directly high for a degenerate model. 99.5% accuracy = 0% recall for fraud. Completely useless. | Precision = TP/(TP+FP) = 0/0 (undefined when zero positives predicted). Recall = 0. F1 = 0. **Q: Why does binary cross-entropy in neural networks mathematically equal the log-likelihood of logistic regression? What does this imply about a neural network with a sigmoid output?** Key points: BCE = -1/n * sum[y*log(p) + (1-y)*log(1-p)]. Log-likelihood = sum[y*log(p) + (1-y)*log(1-p)]. BCE = -1/n * log-likelihood. Minimizing BCE = maximizing log-likelihood. | A neural network with one linear layer + sigmoid + BCE loss is exactly logistic regression. SGD on it converges to the same solution as L-BFGS in sklearn (with small enough lr and enough epochs). **Q: Feature 'income' after StandardScaler has coefficient beta = 0.8. How should this coefficient be interpreted? How does the interpretation change compared to the unstandardized version?** Key points: After StandardScaler, one unit of x_std corresponds to one standard deviation of the original feature. Beta = 0.8 means: income increasing by 1 standard deviation increases log-odds by 0.8, OR = exp(0.8) = 2.23. | In original units: beta_original = beta_std / std(x). If std(income) = $50K, then beta_original = 0.8/50000 = 0.000016 per dollar. **Q: How does multinomial logistic regression work for K > 2 classes? What are the differences between the softmax (multinomial) and One-vs-Rest (OvR) strategies?** Key points: Softmax (multinomial): P(y=k|x) = exp(x @ beta_k) / sum_j exp(x @ beta_j). One model, K sets of coefficients. Probabilities sum to 1 across classes. Equivalent to K-1 free parameter sets (one class is reference). | One-vs-Rest (OvR): K independent binary logistic regressions, each trained '1 class vs all others'. Probabilities from K models do not sum to 1 - normalization or isotonic calibration is needed.
The logit link: why OLS fails for probabilities
OLS on binary Y can predict probabilities outside [0,1] - mathematically meaningless. The logit link g(mu) = log(mu/(1-mu)) maps (0,1) to (-inf,+inf), allowing a linear predictor while keeping predictions as valid probabilities.
The S-curve shape of sigma(eta) = 1/(1+exp(-eta)) means the effect of a predictor is largest near P=0.5 and diminishes at extremes - reflecting the diminishing returns of additional evidence when already near certainty.
A logistic regression gives eta = -2.5 for a customer. What is the predicted default probability?
Odds ratios and coefficient interpretation
Logistic regression coefficients are log-odds. exp(beta) is the odds ratio: for a one-unit increase in x, the odds of Y=1 are multiplied by exp(beta). This multiplicative interpretation requires care - unlike OLS additive effects.
Odds ratios are not risk ratios. OR=2 does not mean the probability doubled. For rare events (P<10%) OR ≈ RR, but for common outcomes the difference can be substantial.
Logistic regression for churn: beta_age = -0.05. What does this mean?
ROC/AUC, calibration, and imbalanced classes
ROC-AUC measures ranking ability - P(score(positive) > score(negative)). It is threshold-independent. Calibration measures whether predicted P=0.7 corresponds to 70% actual positives. Both are needed: a well-ranked but poorly calibrated model misleads decision-making.
Under severe imbalance (99:1), accuracy is deceptive - a constant-0 predictor achieves 99% accuracy. Use PR-AUC (precision-recall), which focuses on the minority class. Decision threshold must be adjusted based on the cost ratio FN/FP.
A fraud detection model achieves 99.5% accuracy but recalls 0% of fraud. What is wrong?
Regularization: L1, L2, and production deployment
L2 (Ridge) shrinks all coefficients toward zero - reduces variance, keeps all features. L1 (Lasso) produces exact zeros - automatic feature selection. In sklearn LogisticRegression, C = 1/lambda (smaller C = stronger regularization).
Logistic regression is still the dominant production model for credit scoring (regulatory interpretability required), medical diagnostics (FDA clearance prefers interpretable models), and ad CTR prediction (billions of sparse binary features with FTRL-L1).
You want logistic regression to automatically select relevant features from 500 candidates. Which regularization should you use?
Task: predict P(loan approved) from applicant age. OLS fit on bank data: P(approved | age) = -0.005 * age + 0.80 Predictions: age = 25 -> P = -0.005*25 + 0.80 = 0.675 OK age = 55 -> P = -0.005*55 + 0.80 = 0.525 OK age = 5 -> P = -0.005*5 + 0.80 = 0.775 OK (a 5-year-old applying for credit?) age = 80 -> P = -0.005*80 + 0.80 = 0.400 OK age = 200 -> P = -0.005*200 + 0.80 = -0.20 !!! At extreme X values, predicted probability escapes [0, 1]. Additional failures: heteroscedasticity (Var(Y) = p(1-p) depends on X), violated normality of residuals. OLS on binary Y is broken by design.
Odds - the ratio of success probability to failure probability: odds = p / (1 - p) Logit - the natural log of odds: logit(p) = log(p / (1-p)) Sigmoid - the inverse transformation from logit to p: sigma(z) = 1 / (1 + e^(-z)) Examples: p = 0.01 -> odds = 0.0101 -> logit = -4.60 p = 0.25 -> odds = 0.333 -> logit = -1.099 p = 0.50 -> odds = 1.000 -> logit = 0.000 p = 0.75 -> odds = 3.000 -> logit = +1.099 p = 0.99 -> odds = 99.0 -> logit = +4.605 The logistic regression model: log(p_i / (1-p_i)) = beta_0 + beta_1*x_1 + ... + beta_k*x_k equivalently: p_i = sigma(beta_0 + beta_1*x_1 + ... + beta_k*x_k) = 1 / (1 + exp(-(beta_0 + beta_1*x_1 + ... + beta_k*x_k))) On the log-odds scale the model is linear - this is the GLM with logit link.
| Feature | Coefficient β | Odds Ratio exp(β) | Interpretation |
|---|---|---|---|
| Income (+$1K) | +0.30 | 1.35 | Each $1K of income increases approval odds by 35% |
| Prior defaults (+1) | -1.20 | 0.30 | Each prior default reduces approval odds by 70% |
| Loan term (+1 year) | -0.08 | 0.92 | Each additional year of term reduces odds by 8% |
| Collateral (0/1) | +1.50 | 4.48 | Having collateral multiplies approval odds by 4.5x |
| Age (+1 year) | +0.02 | 1.02 | Each year of age adds 2% to odds (weak effect) |
Log-likelihood of logistic regression: l(beta) = sum_i [ y_i * log(p_i) + (1 - y_i) * log(1 - p_i) ] where p_i = sigma(X_i @ beta) Binary Cross-Entropy (BCE) in PyTorch/TensorFlow: BCE = -1/n * sum_i [ y_i * log(p_i) + (1 - y_i) * log(1 - p_i) ] The only difference is sign and normalization by n: l(beta) = -n * BCE Maximizing log-likelihood = minimizing BCE. PyTorch's BCEWithLogitsLoss does exactly this - and also ensures numerical stability by fusing sigmoid + log via the log-sum-exp trick. Key implication: a neural network with a sigmoid output and BCE loss is exactly logistic regression. GLMs and neural networks are one family.
Confusion matrix at threshold t: Predict 1 if p >= t, else 0 Actual 0 Actual 1 Pred 0: TN FN Pred 1: FP TP TPR (Recall/Sensitivity) = TP / (TP + FN) - of all actual 1s, fraction caught FPR (Fall-out) = FP / (FP + TN) - of all actual 0s, fraction falsely flagged ROC curve: TPR(t) vs FPR(t) for all t in [0,1] AUC-ROC = P(score(pos) > score(neg)) = probability that the model ranks a random positive higher than a random negative Numerical example: Two models: AUC_A = 0.72, AUC_B = 0.91 Model B: given a random (fraud, legitimate) pair - 91% of the time it correctly scores fraud above legitimate. AUC-PR (Precision-Recall) - better under severe imbalance: Precision = TP / (TP + FP) - of predicted 1s, fraction truly 1 Recall = TP / (TP + FN) PR-AUC = mean precision at varying recall levels
| Metric | Formula | When to use | Key pitfall |
|---|---|---|---|
| Accuracy | (TP+TN) / total | Balanced classes only | 99% neg -> 99% accuracy with zero utility |
| AUC-ROC | P(score_pos > score_neg) | General case, model comparison | Does not measure calibration |
| PR-AUC | Integral of P-R curve | Severe class imbalance (fraud, rare disease) | Sensitive to class prevalence |
| Log-loss | -mean[y*log(p) + (1-y)*log(1-p)] | Calibration check, production monitoring | Sensitive to extreme probabilities |
| F1-score | 2*P*R / (P+R) | Single-number imbalanced classification | beta=1 is not always the right tradeoff |
Penalized log-likelihood: L2 (Ridge): maximize l(beta) - lambda * sum(beta_j^2) L1 (Lasso): maximize l(beta) - lambda * sum(|beta_j|) L2 (sklearn: penalty='l2', C=...) - Ridge: - Shrinks all coefficients toward zero without zeroing any - Handles multicollinearity well (distributes weight across correlated features) - Analytically differentiable - compatible with L-BFGS L1 (sklearn: penalty='l1', solver='saga') - Lasso: - Zeros out coefficients for irrelevant features - Automatic feature selection - Better for sparse data: NLP bag-of-words, ad click logs Elastic Net (penalty='elasticnet', l1_ratio=0.5): - Combination: lambda1*|beta| + lambda2*beta^2 - Selects among correlated groups AND zeros out irrelevant ones Practice: C=10: weak regularization, trust the data C=1.0: default, moderate C=0.1: strong, use when overfitting or small n C=0.01: very strong, when features >> n
Key takeaways
- **OLS is broken for binary Y**: it produces probabilities outside [0,1] and violates homoscedasticity. The logit link solves this elegantly.
- **Logit = log(odds)**: logistic regression is linear in log-odds space, nonlinear in probability space. The key equation: p = sigmoid(X @ beta).
- **Odds Ratio = exp(beta_j)**: a one-unit increase in x_j multiplies odds by OR. OR < 1 decreases odds, OR > 1 increases them. The primary interpretation tool in medical statistics and credit scoring.
- **Binary cross-entropy = -log-likelihood**: a network with sigmoid + BCE loss is logistic regression. Depth adds nonlinearity to feature representation without changing the loss structure.
- **Accuracy is useless under imbalance**: use AUC-ROC (ranking quality), PR-AUC (precision-recall under severe imbalance), calibrated probabilities for risk scoring.
- **C = 1/lambda**: regularization is mandatory in production. L2 shrinks all coefficients; L1 zeros irrelevant ones - automatic feature selection with sparse data.
- **Challenger (1986)**: the data existed. The tool existed. Correct analysis on all available data would have predicted P(failure) > 0.99. Statistics is not an abstraction - it is an engineering discipline with real stakes.
Connections to other topics
Logistic regression sits at the intersection of GLM theory, decision theory, optimization, and modern ML. These connections explain why a model from 1958 remains relevant in 2026.
- GLM framework (stat-37) — Logistic regression is a special case of a GLM with Bernoulli distribution and logit link. Understanding GLM opens Poisson regression, Gamma regression, and their production applications.
- Hypothesis testing for coefficients (stat-05) — Wald test for H0: beta_j = 0, Likelihood Ratio Test for model comparison, z-scores for coefficient significance. Same ideas as OLS, different distribution.
- Model selection: AIC/BIC (stat-41) — How to choose between logistic regression models with different feature sets? AIC = -2*log-likelihood + 2k, BIC = -2*log-likelihood + k*log(n). Parsimony principle in action.
- Cross-validation (stat-42) — AUC estimation via k-fold CV is the standard pipeline. Nested CV for hyperparameter (C) tuning without data leakage.
- Multiple testing with many features (stat-19) — With 1000 features, p-values without Bonferroni / FDR correction produce spurious significant coefficients. LASSO regularization is an alternative approach to sparse feature selection.