Machine Learning
Logistic Regression
From population curves to the logit
The S-shaped curve at the heart of logistic regression was born in demography, not machine learning. In 1838 the Belgian mathematician Pierre Francois Verhulst introduced the logistic function to model how populations grow quickly at first and then level off against limited resources. A century later, in 1944, the physician and statistician Joseph Berkson coined the term "logit" and championed using the logistic model for bioassay, against rivals who preferred the probit. In 1958 the British statistician David Cox formalized logistic regression as a general tool for modeling binary outcomes, giving us the method used today for classification across medicine, finance, and beyond.
Every time Gmail sends an email to spam, a bank blocks a suspicious transaction, or a doctor receives a preliminary AI diagnosis from a blood test - behind it is an algorithm that makes a decision: yes or no, ill or healthy, fraud or not. But how do you turn a set of numbers (patient features, transaction parameters) into a clear answer of "class 0" or "class 1"? And why does an algorithm with the misleading name "regression" actually solve a classification task?
- **Medical diagnostics:** logistic regression is used for preliminary disease screening from blood tests, mammography and symptoms - the model outputs the probability of the disease and the doctor chooses the threshold for further testing
- **Credit scoring:** banks assess the probability of loan default based on dozens of features (income, credit history, age) - logistic regression is popular precisely because its decisions are easy to explain to regulators and clients
- **Recommendation systems:** when Spotify decides whether to show you a playlist and Amazon whether to suggest a product, they evaluate click probability (click-through rate) via softmax over hundreds of user and content features
Предварительные знания
Sigmoid function: from a number to a probability
In the linear regression lesson we learned to predict **continuous numbers**: house price, temperature, salary. But what if the task is different - to determine whether an email is **spam** or not? Whether a patient is **ill** or healthy? Whether to **approve** a loan or refuse? The answer here is not a number but a **class**: 0 or 1, yes or no. This is a **classification** task, and linear regression cannot handle it directly.
The problem is that linear regression outputs numbers from minus infinity to plus infinity: -3.7, 0.2, 158.4 - anything. But we need a **probability** between 0 and 1: "with probability 0.92 this is spam". We need a function that "squeezes" any number into the interval [0, 1]. Such a function exists - and it is called **sigmoid**.
The idea of logistic regression is simple: we take the linear combination **z = w*x + b** (as in linear regression) and pass it through sigmoid. The result is the probability of belonging to class 1. With multiple features: **z = w1*x1 + w2*x2 + ... + wn*xn + b**. The weights w and bias b are parameters the model tunes during training.
**Why can't MSE be used for logistic regression?** MSE (Mean Squared Error) works great for linear regression, but with sigmoid it creates a **non-convex** loss function - with many local minima where gradient descent gets stuck. That is why **Binary Cross-Entropy** is used: Loss = -[y * log(p) + (1 - y) * log(1 - p)]. If y=1, we penalize for a small p; if y=0, we penalize for a large p. This function is **convex** - it has exactly one minimum, and gradient descent is guaranteed to find it.
Why is the linear combination z = w*x + b passed through the sigmoid function in logistic regression?
Decision Boundary
Sigmoid outputs a probability, but the final decision is binary: spam or not spam, approve or refuse. A **threshold** is needed: if probability >= 0.5, predict class 1, otherwise - class 0. The line (or surface) where the probability is exactly 0.5 is called the **decision boundary**. Everything on one side is class 1, on the other side is class 0.
When probability p = 0.5, sigmoid(z) = 0.5, which means z = 0. That is, the decision boundary is the set of points where **w*x + b = 0**. For two features: w1*x1 + w2*x2 + b = 0 - this is the equation of a **line** in the plane. For three features - a plane in 3D space. The decision boundary of logistic regression is always **linear** - this is both the strength and the limitation of the algorithm.
**Distance from boundary = model confidence.** A point far to the right of the boundary: z = +5, p = 0.993 - the model is 99.3% confident it is class 1. A point close to the boundary: z = +0.1, p = 0.525 - the model is barely confident (52.5%), the prediction is unreliable. This is exactly why in critical tasks (medicine, finance) one looks not just at the predicted class but at the **probability** - the degree of the model's confidence.
A threshold of 0.5 is not the only option. It can be adjusted depending on the task. In medical cancer screening, missing a sick patient (false negative) is more dangerous than a false alarm (false positive). So the threshold is lowered to **0.3**: the model will more often say "ill", increasing **recall** at the cost of some drop in **precision**. For a spam filter it's the opposite: it's better to miss spam than to send an important email to the trash - the threshold is raised to **0.7** or higher.
**Limitation: linear decision boundary.** Logistic regression can only separate classes with a straight line (or hyperplane). If the data is separated nonlinearly - for example, one class inside a ring and the other outside - logistic regression will fail. For such cases nonlinear methods are needed: SVM with kernels, decision trees, or neural networks.
A doctor uses logistic regression to diagnose a serious disease. What is best to do with the threshold?
Multi-class classification
So far we have separated data into **two** classes: spam/not spam, ill/healthy. But in reality there are often more classes: recognizing digits from 0 to 9 (10 classes), identifying the language of a text (200+ languages), classifying tumor type (dozens of variants). How do we adapt a binary classifier to a **multi-class** task?
The first approach is **One-vs-Rest (OvR)**, also known as One-vs-All. For N classes we train **N binary classifiers**: each separates one specific class from all the others. For digit recognition: classifier #0 answers the question "is this a 0 or not?", classifier #1 - "is this a 1 or not?", and so on. When predicting, we run all N classifiers and choose the class with the highest probability.
The second approach is **One-vs-One (OvO)**. We train a classifier for **each pair** of classes: "0 vs 1", "0 vs 2", "1 vs 2". For N classes that is **N*(N-1)/2** classifiers. For 10 digits: 10*9/2 = 45 classifiers. That sounds like a lot, but each is trained on less data (only examples of two classes), which can be faster. When predicting, each classifier "votes" for one of its two classes, and the class with the most votes wins.
**OvR vs OvO: when to use which?** - **OvR** - faster (N classifiers instead of N*(N-1)/2), simpler, works well for most tasks. This is the default choice in sklearn. - **OvO** - can be more accurate when classes are strongly imbalanced or boundaries between them are complex. Each classifier sees only 2 classes, which simplifies the task. - **In practice:** sklearn uses the 'ovr' strategy for LogisticRegression by default, but you can switch to 'multinomial' (softmax) - covered in the next concept.
You are training a logistic regression to recognize 10 types of animals in photos. Using OvO (One-vs-One), how many binary classifiers need to be trained?
Softmax: probabilities for N classes
OvR and OvO are "wrappers" around binary classifiers. But there is an elegant solution that **directly** models the probabilities of N classes simultaneously - the **softmax** function. Instead of N separate sigmoids, softmax takes N numbers (one for each class) and turns them into a **probability distribution**: all values between 0 and 1, summing to exactly 1.0.
Softmax is a **generalization of sigmoid** to N > 2 classes. For two classes softmax is mathematically equivalent to sigmoid: substituting N=2, the formulas simplify to sigma(z) = 1 / (1 + e^(-z)). This is not a coincidence - sigmoid is a special case of softmax. That is why when sklearn uses multi_class='multinomial', it applies softmax instead of N separate sigmoids.
**Temperature scaling in practice:** - **T = 1.0** - standard softmax, used during training - **T < 1.0** - makes the distribution sharper: the model confidently picks one class. Used during inference in production. - **T > 1.0** - makes the distribution "softer": probabilities closer to uniform. Used in **knowledge distillation** (transferring knowledge from a large model to a small one) and when generating text in LLMs (temperature in ChatGPT). When you set "temperature = 0.2" in ChatGPT - you are literally changing T in softmax, making the model more predictable.
Softmax uses **Cross-Entropy Loss** - a generalization of Binary Cross-Entropy to N classes: Loss = -sum(y_i * log(p_i)) over all classes. Since y is a one-hot vector (only one element equals 1), in practice this reduces to -log(p) for the correct class. If the model predicts the correct class with probability 0.95 - the penalty is small (-log(0.95) = 0.05). With probability 0.01 - the penalty is huge (-log(0.01) = 4.6).
**Numerical instability of softmax.** Computing e^z for large z leads to overflow: e^1000 = infinity. The fix - subtract max(z) from all logits before exponentiation: softmax(z - max(z)) gives the same result without overflow. All libraries (PyTorch, TensorFlow, numpy) do this automatically, but when implementing by hand you need to keep this in mind.
Logistic regression is a type of regression, not classification, because the word "regression" is in the name
Despite the name, logistic regression is a classification algorithm that predicts the probability of belonging to a class, not a continuous number
The name "regression" is historical: the algorithm "regresses" (fits) the parameters of a linear function. But sigmoid converts the output into a probability, and the threshold converts it into a binary decision. The result is a class, not a number, which makes this classification. Similarly, softmax extends the approach to N classes - this is also classification, even though a linear model works inside.
Softmax of logits [3.0, 1.0, 1.0] gives probabilities [0.78, 0.11, 0.11]. What happens with temperature T = 0.1?
Key ideas
- **Sigmoid** converts the linear combination z = w*x + b into a probability from 0 to 1: sigma(z) = 1 / (1 + e^(-z)), trained via Binary Cross-Entropy
- **Decision boundary** - the line where p = 0.5 (z = 0), separating classes; the threshold can be shifted to balance precision/recall depending on the cost of errors
- **Multi-class classification:** OvR trains N binary classifiers (linear complexity), OvO trains N*(N-1)/2 (quadratic), the choice depends on the task
- **Softmax** generalizes sigmoid to N classes, normalizing logits into a probability distribution summing to 1.0; temperature scaling controls confidence - this is exactly what's behind how a bank decides to block a transaction and Gmail sends an email to spam, as we discussed at the start
Related topics
Logistic regression is a bridge between linear methods and more complex classification algorithms. Here is where to go next:
- Linear Regression — Logistic regression uses the same linear combination w*x + b, but adds sigmoid for classification. Understanding linear regression is the foundation for understanding logistic regression.
- Model Evaluation Metrics — Precision, recall, F1-score, ROC-AUC - metrics without which it is impossible to choose the right threshold for the decision boundary and evaluate classifier quality
- Decision Trees — Unlike logistic regression, trees create nonlinear decision boundaries, dividing the space into rectangular regions - the next step in classification
- SVM (Support Vector Machines) — SVM also builds a linear boundary, but maximizes the margin between classes, and with the kernel trick solves nonlinear tasks that logistic regression cannot handle
Вопросы для размышления
- Logistic regression creates a linear decision boundary. Give an example of a real-world task where a linear boundary would work well, and one where it fundamentally fails.
- In medical screening we lower the threshold to 0.3 to increase recall. But what if there are too many false positives - overloaded doctors start ignoring warnings? Where is the balance between safety and practicality?
- Softmax with temperature = 0 would give probability 1.0 for the class with the highest logit and 0.0 for the rest (argmax). In which tasks is this useful, and in which is it dangerous to lose information about the model's uncertainty?
Связанные уроки
- ml-06-linear-regression — Logistic regression is a linear model with a non-linear activation
- ml-05-evaluation — Precision/recall/AUC are the key metrics for binary classification
- ml-09-gradient-descent — Cross-entropy is minimized via gradient descent
- stat-38-logistic-regression
- prob-11-normal