Statistics
Linear Regression
Цели урока
- Understand the linear regression model
- Find coefficients using ordinary least squares
- Interpret R² and evaluate model quality
- Make predictions and construct confidence intervals
Предварительные знания
- Correlation
- Maximum likelihood estimation
Given a person's height, can weight be predicted? Given an apartment's area, can price be estimated? Regression builds a "prediction formula": Y = a + bX. This is the foundation of machine learning and data analysis.
- Real estate: price from area and neighborhood
- Economics: forecasting GDP from indicators
- Medicine: drug dosage from patient weight
- Marketing: sales from advertising budget
- ML: neural networks are generalized regression!
The Linear Regression Model
The Linear Regression Model
The relationship between X (predictor) and Y (response):
- $\beta_0$ - intercept (value of Y when X = 0)
- $\beta_1$ - slope (how much Y changes per unit increase in X)
- $\varepsilon \sim N(0, \sigma^2)$ - random error
The assumption is that the true relationship is linear, and deviations from the line are random noise.
What are the key assumptions of classical linear regression Y = β₀ + β₁X + ε?
Gauss-Markov assumptions: (1) linearity E[Y|X] = β₀ + β₁X; (2) independence of residuals ε_i; (3) homoscedasticity Var(ε|X) = σ² (same for all X); (4) exogeneity E[ε|X] = 0; for t-tests also (5) normality of ε. Under homoscedasticity OLS is BLUE (Best Linear Unbiased Estimator) by the Gauss-Markov theorem.
Ordinary Least Squares (OLS)
Ordinary Least Squares (OLS)
Find the line that minimizes the sum of squared errors:
Solution:
Apartment Area and Price
Simple regression
Data (sq m, $thousands): {(30, 300), (40, 450), (50, 500), (60, 650), (70, 700)} $\bar{X} = 50$, $\bar{Y} = 520$ $\sum(X_i - 50)(Y_i - 520) = 7000$ $\sum(X_i - 50)^2 = 1000$ $\hat{\beta}_1 = 7000/1000 = 7$ $/m² $\hat{\beta}_0 = 520 - 7 \cdot 50 = 170$ **Model:** $\hat{Y} = 170 + 7X$ Every additional square meter adds $7,000 to the price.
In the regression Y = 10 + 2X, what does the coefficient 2 represent?
The slope β₁ = 2 means that for each unit increase in X, the expected value of Y increases by 2 units.
Coefficient of Determination R²
Coefficient of Determination R²
**R²** - the proportion of Y's variance explained by the model:
- $R^2 = 0$: the model explains none of Y's variation (horizontal line)
- $R^2 = 1$: the model perfectly explains Y (all points on the line)
- $R^2 = r^2$ for simple regression!
Interpreting R²
R² = 0.81
The model explains 81% of the variation in Y. The remaining 19% reflects the influence of other factors and randomness. In the apartment example: area explains most of the price, but not all (neighborhood, renovation condition...).
What does R² = 0.65 mean in linear regression?
R² = 1 - RSS/TSS = SSReg/TSS, where TSS = Σ(Y_i - Ȳ)², RSS = Σ(Y_i - Ŷ_i)². For simple regression R² = r² (squared correlation). R² is not a 'goodness' criterion: high R² with irrelevant predictors via overfitting; low R² can still be meaningful in noisy domains (psychology, biomedicine). Adjusted R² penalises for the number of predictors.
Significance Testing
Significance Testing
H₀: $\beta_1 = 0$ (X has no effect on Y)
Where $SE(\hat{\beta}_1) = \frac{S}{\sqrt{\sum(X_i - \bar{X})^2}}$ and $S$ is the residual standard error.
In simple regression, the t-test for β₁ is equivalent to the significance test for correlation!
How is the significance of coefficient β₁ in linear regression tested?
Under H_0: β₁ = 0 (no association) and standard assumptions, t = β̂₁ / SE(β̂₁) ~ t_{n-2}. SE(β̂₁) = σ̂ / √Σ(X_i - X̄)². Large |t| means the estimated coefficient differs significantly from zero. The F-test checks joint significance of all predictors; for a single predictor it is equivalent to the t-test.
Prediction
Prediction
For a new value X₀, the predicted response is:
**Beware of extrapolation!** The model is reliable only within the range of the original X values. Predictions far beyond the data range are risky.
Predicting the Price
Apartment of 55 sq m
$\hat{Y} = 170 + 7 \cdot 55 = $555,000 Confidence interval for the mean: narrower Prediction interval for a specific apartment: wider (includes σ)
How does a confidence interval for E[Y|X*] differ from a prediction interval for a new Y* at a given X*?
SE for E[Y|X*]: σ̂·√(1/n + (X*-X̄)²/Σ(X_i-X̄)²). SE for a new Y* adds a √(1 + ...) factor: it includes the σ² of a single observation. At X* = X̄ the prediction interval is wider by a factor of ~σ̂√n. Confusing these two intervals is a common mistake: one says 'where the mean is', the other 'where a new point will lie'.
Practice
Practice
Data from 10 stores yield the regression: Sales = 50 + 8×Advertising ($thousands). R² = 0.64. What are expected sales when advertising spend is $20,000? What does R² mean here?
Sales = 50 + 8×20 = $210,000 R² = 0.64: advertising explains 64% of the variation in sales across stores. The remaining 36% reflects other factors (location, assortment...).
In a regression with n = 30, R² = 0.25 is obtained. Is the association significant (α = 0.05)?
$r = \sqrt{0.25} = 0.5$ $t = 0.5 \sqrt{\frac{28}{0.75}} = 0.5 \cdot 6.11 = 3.06$ $t_{0.025, 28} \approx 2.05$ $3.06 > 2.05$ → the association is significant!
Which residual diagnostic is most important for checking homoscedasticity?
Heteroscedasticity (a funnel in residuals vs fitted) breaks SE(β̂) and t-tests. Tests: Breusch-Pagan, White. Fixes: Huber-White robust SE, WLS (weighted LS), Y transformations (log, √). A QQ-plot diagnoses normality of residuals (matters for t/F tests at small n). Cook's distance flags influential observations.
Regression - The Foundation of ML
From a simple line to neural networks.
- Multiple Regression — Many predictors X₁, X₂, ...
- Logistic Regression — For classification (0/1)
- Neural Networks — Many layers of nonlinear regression
- Gradient Descent — Numerical minimization of SSE
Итоги
- **Model:** $Y = \beta_0 + \beta_1 X + \varepsilon$
- **OLS:** minimize $\sum(Y - \hat{Y})^2$
- **Formulas:** $\hat{\beta}_1 = r \cdot S_Y/S_X$, $\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}$
- **R²:** proportion of explained variance, $R^2 = r^2$
- Regression is the foundation of ML!
Вопросы для размышления
- How does regression differ from correlation?
- Why is it dangerous to extrapolate far beyond the data range?
- How is regression related to neural networks?