Statistics

Linear Regression

Цели урока

Understand the linear regression model
Find coefficients using ordinary least squares
Interpret R² and evaluate model quality
Make predictions and construct confidence intervals

Предварительные знания

Correlation
Maximum likelihood estimation

Given a person's height, can weight be predicted? Given an apartment's area, can price be estimated? Regression builds a "prediction formula": Y = a + bX. This is the foundation of machine learning and data analysis.

Real estate: price from area and neighborhood
Economics: forecasting GDP from indicators
Medicine: drug dosage from patient weight
Marketing: sales from advertising budget
ML: neural networks are generalized regression!

The Linear Regression Model

The relationship between X (predictor) and Y (response):

$\beta_0$ - intercept (value of Y when X = 0)
$\beta_1$ - slope (how much Y changes per unit increase in X)
$\varepsilon \sim N(0, \sigma^2)$ - random error

The assumption is that the true relationship is linear, and deviations from the line are random noise.

What are the key assumptions of classical linear regression Y = β₀ + β₁X + ε?

Gauss-Markov assumptions: (1) linearity E[Y|X] = β₀ + β₁X; (2) independence of residuals ε_i; (3) homoscedasticity Var(ε|X) = σ² (same for all X); (4) exogeneity E[ε|X] = 0; for t-tests also (5) normality of ε. Under homoscedasticity OLS is BLUE (Best Linear Unbiased Estimator) by the Gauss-Markov theorem.

Ordinary Least Squares (OLS)

Find the line that minimizes the sum of squared errors:

Solution:

Apartment Area and Price

Simple regression

Data (sq m, $thousands): {(30, 300), (40, 450), (50, 500), (60, 650), (70, 700)} $\bar{X} = 50$, $\bar{Y} = 520$ $\sum(X_i - 50)(Y_i - 520) = 7000$ $\sum(X_i - 50)^2 = 1000$ $\hat{\beta}_1 = 7000/1000 = 7$ $/m² $\hat{\beta}_0 = 520 - 7 \cdot 50 = 170$ **Model:** $\hat{Y} = 170 + 7X$ Every additional square meter adds $7,000 to the price.

In the regression Y = 10 + 2X, what does the coefficient 2 represent?

The slope β₁ = 2 means that for each unit increase in X, the expected value of Y increases by 2 units.

Coefficient of Determination R²

**R²** - the proportion of Y's variance explained by the model:

$R^2 = 0$: the model explains none of Y's variation (horizontal line)
$R^2 = 1$: the model perfectly explains Y (all points on the line)
$R^2 = r^2$ for simple regression!

Interpreting R²

R² = 0.81

The model explains 81% of the variation in Y. The remaining 19% reflects the influence of other factors and randomness. In the apartment example: area explains most of the price, but not all (neighborhood, renovation condition...).

What does R² = 0.65 mean in linear regression?

R² = 1 - RSS/TSS = SSReg/TSS, where TSS = Σ(Y_i - Ȳ)², RSS = Σ(Y_i - Ŷ_i)². For simple regression R² = r² (squared correlation). R² is not a 'goodness' criterion: high R² with irrelevant predictors via overfitting; low R² can still be meaningful in noisy domains (psychology, biomedicine). Adjusted R² penalises for the number of predictors.

Significance Testing

H₀: $\beta_1 = 0$ (X has no effect on Y)

Where $SE(\hat{\beta}_1) = \frac{S}{\sqrt{\sum(X_i - \bar{X})^2}}$ and $S$ is the residual standard error.

In simple regression, the t-test for β₁ is equivalent to the significance test for correlation!

How is the significance of coefficient β₁ in linear regression tested?

Under H_0: β₁ = 0 (no association) and standard assumptions, t = β̂₁ / SE(β̂₁) ~ t_{n-2}. SE(β̂₁) = σ̂ / √Σ(X_i - X̄)². Large |t| means the estimated coefficient differs significantly from zero. The F-test checks joint significance of all predictors; for a single predictor it is equivalent to the t-test.

Prediction

For a new value X₀, the predicted response is:

**Beware of extrapolation!** The model is reliable only within the range of the original X values. Predictions far beyond the data range are risky.

Predicting the Price

Apartment of 55 sq m

$\hat{Y} = 170 + 7 \cdot 55 = $555,000 Confidence interval for the mean: narrower Prediction interval for a specific apartment: wider (includes σ)

How does a confidence interval for E[Y|X*] differ from a prediction interval for a new Y* at a given X*?

SE for E[Y|X*]: σ̂·√(1/n + (X*-X̄)²/Σ(X_i-X̄)²). SE for a new Y* adds a √(1 + ...) factor: it includes the σ² of a single observation. At X* = X̄ the prediction interval is wider by a factor of ~σ̂√n. Confusing these two intervals is a common mistake: one says 'where the mean is', the other 'where a new point will lie'.

Practice

Data from 10 stores yield the regression: Sales = 50 + 8×Advertising ($thousands). R² = 0.64. What are expected sales when advertising spend is $20,000? What does R² mean here?

Sales = 50 + 8×20 = $210,000 R² = 0.64: advertising explains 64% of the variation in sales across stores. The remaining 36% reflects other factors (location, assortment...).

In a regression with n = 30, R² = 0.25 is obtained. Is the association significant (α = 0.05)?

$r = \sqrt{0.25} = 0.5$ $t = 0.5 \sqrt{\frac{28}{0.75}} = 0.5 \cdot 6.11 = 3.06$ $t_{0.025, 28} \approx 2.05$ $3.06 > 2.05$ → the association is significant!

Which residual diagnostic is most important for checking homoscedasticity?

Heteroscedasticity (a funnel in residuals vs fitted) breaks SE(β̂) and t-tests. Tests: Breusch-Pagan, White. Fixes: Huber-White robust SE, WLS (weighted LS), Y transformations (log, √). A QQ-plot diagnoses normality of residuals (matters for t/F tests at small n). Cook's distance flags influential observations.

Regression - The Foundation of ML

From a simple line to neural networks.

Multiple Regression — Many predictors X₁, X₂, ...
Logistic Regression — For classification (0/1)
Neural Networks — Many layers of nonlinear regression
Gradient Descent — Numerical minimization of SSE

Итоги

**Model:** $Y = \beta_0 + \beta_1 X + \varepsilon$
**OLS:** minimize $\sum(Y - \hat{Y})^2$
**Formulas:** $\hat{\beta}_1 = r \cdot S_Y/S_X$, $\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}$
**R²:** proportion of explained variance, $R^2 = r^2$
Regression is the foundation of ML!

Вопросы для размышления

How does regression differ from correlation?
Why is it dangerous to extrapolate far beyond the data range?
How is regression related to neural networks?

Цели урока

Предварительные знания

The Linear Regression Model

The Linear Regression Model

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)

Apartment Area and Price

Coefficient of Determination R²

Coefficient of Determination R²

Interpreting R²

Significance Testing

Significance Testing

Prediction

Prediction

Predicting the Price

Practice

Practice

Regression - The Foundation of ML

Итоги

Вопросы для размышления

Связанные уроки