Machine Learning

Linear Regression

When an insurance company estimates a policy's value, when an economist predicts GDP growth, when a data scientist forecasts next quarter's revenue - behind all of this is the same algorithm that is over 200 years old. Carl Friedrich Gauss used it in 1809 to predict asteroid orbits. Today it is the first algorithm taught in any ML course. One formula, one straight line - trillion-dollar decisions ride on it. The question: how does a line get drawn through a cloud of points to minimize error?

  • **Zillow Zestimate** values 100+ million homes in the US using linear regression with hundreds of features (area, neighborhood, schools, crime) - and real transactions worth trillions depend on this estimate
  • **Pharmaceutics** uses linear regression to determine the dependence of drug dosage on patient parameters (weight, age, kidney function) - an error in prediction can cost a life
  • **Climatology** models the relationship between CO2 concentration and average planetary temperature with linear regression - this model was the basis for the first climate change warnings in the 1960s

Предварительные знания

  • Math for ML

Least squares and the name "regression"

Linear regression has two separate origins. The method of least squares first appeared in print in 1805, when Adrien-Marie Legendre published it as a way to fit orbits to astronomical observations. Carl Friedrich Gauss claimed he had been using it since 1795 and published his own version in 1809, sparking a long priority dispute that historians still debate. The word "regression" came much later, from Francis Galton in 1886. Studying heredity, he noticed that the sons of very tall fathers tended to be tall but closer to average, a pull he called "regression to the mean". The statistical line he drew to describe it kept the name, which is why a method about fitting lines is called regression at all.

Hypothesis: y = wx + b

A real estate agent wants to predict apartment prices based on area. The data: 30 sq m - $3M, 50 sq m - $5M, 80 sq m - $7.5M. Plotting these points on a graph, they fall **roughly along a straight line**. Linear regression is the algorithm that finds this line automatically.

Mathematically, a straight line is described by the equation **y = wx + b**, where **w** (weight) is the slope and **b** (bias) is the y-intercept. In ML this equation is called the **hypothesis** - our assumption about how input x relates to output y. The goal of training: find w and b such that the line passes as close as possible to all data points.

When there are multiple features (area, floor, distance to subway), the equation expands: **y = w1*x1 + w2*x2 + ... + wn*xn + b**. Each feature gets its own weight wi, determining its contribution to the prediction. In matrix notation this is compact: **y = X * W + b**, where X is the data matrix and W is the weight vector.

**Why is it called 'linear' regression?** Because the prediction is a *linear combination* of input features: each xi is multiplied by its weight wi and summed. Graphically in 2D this is a straight line, in 3D - a plane, in higher dimensions - a hyperplane. The model cannot learn curved relationships (parabola, sine) - polynomial regression is required for that.

In the hypothesis equation y = wx + b, what happens to the prediction if weight w is increased while b is fixed?

Cost function: MSE

The goal is to find the 'best' w and b. But how can one formally define that one line is better than another? An **error metric** is needed - a number that shows how far our predictions are from the actual values. The smaller this number, the better the model. Such a metric is called the **cost function** or **loss function**.

The most common cost function for regression is **Mean Squared Error (MSE)**: the average of squared deviations of predictions from true values. Formula: **J(w, b) = (1/n) * sum((y_pred_i - y_actual_i)^2)**, where n is the number of data points, y_pred is the model's prediction, y_actual is the real value.

**Why use the squared error, not the absolute value?** Three reasons: 1. **Penalty for large errors** - an error of 10 is penalized 100, not 10 (square grows faster). Big misses are more critical than small ones 2. **Always positive** - the square of a number >= 0, so positive and negative errors don't cancel each other 3. **Differentiable** - the derivative of x^2 = 2x, a smooth function. The absolute value |x| has a kink at zero, making optimization harder

The convexity of MSE is the key property of linear regression. Unlike neural networks, where the cost function has many local minima, linear regression has **exactly one** minimum - the global one. This allows finding an exact analytical solution rather than tuning parameters iteratively.

**MSE is sensitive to outliers!** If one apartment of 30 sq m costs $30M (a data error or a penthouse with a view), the squared error will be enormous and 'pull' the line toward it. For data with outliers, use MAE (Mean Absolute Error) or Huber loss, which penalize large errors more gently.

Why does MSE use the squared difference, not just the difference (y_pred - y_actual)?

Normal Equation: the analytical solution

We've defined the MSE cost function and know we need to find its minimum. From calculus: **the minimum of a smooth function is where the derivative equals zero**. For the MSE of linear regression, this condition can be solved analytically - giving a formula that yields the optimal weights W in **one step**, without iterations. This formula is called the **Normal Equation**.

The beauty of this formula is that it gives an **exact** solution in a single operation. No need to tune a learning rate, no need to wait for convergence, no hyperparameters. Plug in data - get optimal weights. That's why linear regression is the algorithm that starts ML education: it has a **closed-form analytical solution**, unlike most other models.

**Why add a column of ones?** To include the bias b in the matrix multiplication. Without the ones column: y = w*x. With the ones column: y = [x, 1] * [w, b]^T = w*x + b*1 = w*x + b. This is a standard trick that consolidates all parameters (weights and bias) into a single vector W.

**When the Normal Equation doesn't work:** The matrix X^T * X must be invertible (non-singular). It becomes non-invertible if: 1. features are **linearly dependent** - e.g., area in sq m and area in sq ft (one = the other * 10.764) 2. **fewer examples than features** (n < m). In practice, use the **pseudoinverse** (np.linalg.pinv), which works even in these cases.

Why isn't the Normal Equation W = (X^T * X)^(-1) * X^T * y used for models with millions of features?

Linear regression implementation

Time to put all the pieces together: from loading data to evaluating the model. In practice, linear regression is implemented through the **scikit-learn** library, which wraps the normal equation in a convenient API. But before using a ready-made tool, it's worth understanding how to evaluate model quality and what to pay attention to.

Linear regression makes several **assumptions** about data. If they're violated, the model may give unreliable predictions even if training metrics look good.

**Linear regression as a baseline.** In the ML industry there's a rule: always start with a simple model. Linear regression trains in milliseconds, is easy to interpret (each feature's weight is visible), and often gives 'good enough' results. If linear regression achieves R^2 = 0.85, it is worth considering carefully whether a neural network with R^2 = 0.87 that is 1000x slower and more complex is warranted.

Linear regression works for any data - just draw a line through the points

Linear regression assumes a linear relationship between features and the target variable. For nonlinear relationships (parabolic, exponential), polynomial regression or other models are needed

If the relationship between X and y is nonlinear (e.g., price grows exponentially), a straight line systematically errs: it underestimates at the extremes and overestimates in the center. High R^2 on train doesn't guarantee the model correctly describes the real relationship - check the residual plot and assumptions

A linear regression model shows R^2 = 0.98 on the training set, but R^2 = 0.45 on the test set. What happened?

Key ideas

  • **Hypothesis** of linear regression: y = wx + b - a straight line where w (weight) determines the slope and b (bias) the y-intercept. For multiple features: y = X * W + b
  • **MSE (Mean Squared Error)** measures model quality: the smaller the mean of squared deviations, the better the line fits the data. MSE for linear regression is a convex function with a single minimum
  • **Normal Equation** W = (X^T * X)^(-1) * X^T * y finds optimal weights in one step, but has O(m^3) complexity - for large data, gradient descent is used
  • **In practice**, linear regression is the baseline model that starts any ML project. It's simple, interpretable and competitive - just as Gauss predicted asteroid orbits with it, modern real estate valuation models use the same principle of minimizing squared errors

Related topics

Linear regression is the foundation on which more advanced methods are built. Each extension addresses a specific limitation of the basic model:

  • Mathematical foundations of ML — Linear algebra (matrix multiplication, matrix inversion) and derivatives - the mathematical foundation needed to understand the normal equation and MSE
  • Polynomial Regression — Extension of linear regression for nonlinear relationships: we add x^2, x^3 as new features, preserving linearity in parameters
  • Regularization (L1/L2) — Solves the overfitting problem in linear regression: adds a penalty for large weights to the cost function (Ridge, Lasso, ElasticNet)
  • Gradient Descent — Alternative to the normal equation for finding the MSE minimum: an iterative method that works for any number of features and nonlinear models

Вопросы для размышления

  • If 100 features are added (area, floor, neighborhood, wall color, weekday of the listing) - does prediction become more accurate? When do more features hurt rather than help?
  • Linear regression assumes the relationship between apartment price and area is linear. In which ranges might this be violated and why?
  • Why do practitioners often start with linear regression even when they know the data is nonlinear? What value does a knowingly simplified model provide?

Связанные уроки

  • ml-05-evaluation — MSE and R^2 are the standard metrics for evaluating regression
  • ml-07-polynomial-regression — Polynomial regression is the non-linear extension of linear regression
  • ml-08-regularization — Regularization (Ridge/Lasso) is applied to linear regression
  • stat-09-regression
  • la-06-gauss
Linear Regression

0

1

Sign In