Statistics
Correlation
Цели урока
- Understand the concept of correlation
- Compute the Pearson correlation coefficient
- Interpret the strength and direction of association
- Know the limitations of correlation
Предварительные знания
- Expected value
- Variance
- Samples
Height and weight are related - taller people tend to be heavier. But by how much? Correlation gives a single number from -1 to +1 that captures the strength and direction of a linear association. This is the first step toward understanding relationships in data.
- Finance: correlation of assets in a portfolio
- Medicine: association between risk factors
- Psychology: test score correlations
- Marketing: association between advertising and sales
- ML: feature selection
Covariance
Covariance
**Covariance** - a measure of joint variability of two variables:
- $Cov > 0$: when X increases, Y also increases
- $Cov < 0$: when X increases, Y decreases
- $Cov = 0$: no linear association
Covariance depends on scale! $Cov(\text{height in cm}, \text{weight}) \neq Cov(\text{height in m}, \text{weight})$
What does the covariance Cov(X, Y) = E[(X - μ_X)(Y - μ_Y)] measure?
Cov > 0 means variables move together; Cov < 0 means one rises while the other falls; Cov = 0 means no linear relation (nonlinear ones may exist). Units are the product of the units of X and Y (e.g., kg·cm), which hampers interpretation. Normalising by σ_X·σ_Y yields the dimensionless Pearson correlation in [-1, 1].
Pearson Correlation Coefficient
Pearson Correlation Coefficient
**Correlation** - covariance normalized to be dimensionless:
Properties: $-1 \leq r \leq 1$
| r | Interpretation |
|---|---|
| 0.9 - 1.0 | Very strong positive |
| 0.7 - 0.9 | Strong positive |
| 0.4 - 0.7 | Moderate positive |
| 0.2 - 0.4 | Weak positive |
| 0 - 0.2 | Very weak or none |
| -1 - 0 | Same, but negative |
Height and Weight
5 people
Height X: {160, 170, 175, 180, 185} Weight Y: {55, 65, 70, 75, 85} $\bar{X} = 174$, $\bar{Y} = 70$ $\sum(X_i - \bar{X})(Y_i - \bar{Y}) = 350$ $\sum(X_i - \bar{X})^2 = 350$ $\sum(Y_i - \bar{Y})^2 = 500$ $r = \frac{350}{\sqrt{350 \cdot 500}} = \frac{350}{418.3} \approx 0.84$ Strong positive correlation!
The correlation between X and Y is 0.8. What is the correlation between Y and X?
Correlation is symmetric: $r(X, Y) = r(Y, X)$. The formula is unchanged when X and Y are swapped.
Correlation ≠ Causation!
Correlation ≠ Causation!
If X and Y are correlated, X causes Y
Correlation can arise from a third variable or be purely coincidental
Ice cream sales and drownings are correlated. Ice cream does not cause drownings - both depend on hot weather!
Spurious Correlations
Amusing examples
• Per capita cheese consumption correlates with deaths by bedsheet tangling (r ≈ 0.95!) • The age of Miss America correlates with steam-related deaths • Nicolas Cage movies correlate with swimming pool drownings These are coincidental patterns, not causal relationships!
Which statement about correlation and causation is correct?
A classic example: ice-cream sales and drowning counts correlate (~0.8) but are not causal (a common cause, hot weather, drives both). Alternatives: (1) X → Y, (2) Y → X (reverse), (3) Z → X and Z → Y (confounder), (4) random coincidence. Causal inference needs RCTs, instrumental variables, or counterfactual analysis.
Limitations of Pearson Correlation
Limitations of Pearson Correlation
- Measures only **linear** association (not curvilinear)
- Sensitive to **outliers**
- Requires **normality** for significance tests
- r = 0 does not imply independence!
For nonlinear or ordinal data, use **Spearman's correlation** (rank-based).
When can the Pearson correlation be unreliable or misleading?
Anscombe's quartet (1973): 4 datasets with identical μ, σ, and r = 0.816, but visually wildly different (linear, curve, outlier, etc.). Pearson measures only the linear part, is outlier-sensitive (one point can shift r from 0.9 to 0.1). Alternatives: Spearman (rank correlation), Kendall τ, distance correlation for nonlinearity.
Significance Testing
Significance Testing
H₀: $\rho = 0$ (no correlation in the population)
Is r = 0.6 Significant at n = 20?
Significance test
$t = 0.6 \sqrt{\frac{18}{1-0.36}} = 0.6 \sqrt{28.125} = 3.18$ $df = 18$, $t_{0.025, 18} \approx 2.1$ $3.18 > 2.1$ → the correlation is significant!
How do you test the statistical significance of an observed correlation coefficient r?
Under H_0: ρ = 0 and joint normality (X, Y) ~ BVN, t = r·√((n-2)/(1-r²)) follows Student's t with n-2 degrees of freedom. For CIs use Fisher's z-transform: z = 0.5·log((1+r)/(1-r)) ≈ N(arctanh(ρ), 1/(n-3)). At large n even a tiny r (0.05) becomes significant, hence the importance of distinguishing statistical from practical significance.
Practice
Practice
Study hours X and exam score Y: r = 0.75, n = 25. Is the association significant (α = 0.05)?
$t = 0.75 \sqrt{\frac{23}{1-0.5625}} = 0.75 \sqrt{52.57} = 5.44$ $t_{0.025, 23} \approx 2.07$ $5.44 > 2.07$ → significant positive correlation.
In practice you found r = 0.7 (p < 0.001) between marketing spend and sales. What can you conclude?
r = 0.7 is significant but possible explanations include: (1) reverse causation (successful sales fund more marketing); (2) confounder (season: Q4 sales and marketing grow together); (3) selection bias (the company spends more marketing on strong products). R² = 0.49 means marketing 'explains' 49% of sales variance in observation, but that is not the causal effect.
Итоги
- **Covariance:** $Cov(X,Y) = E[XY] - E[X]E[Y]$
- **Pearson correlation:** $r = Cov(X,Y)/(\sigma_X \sigma_Y)$, $-1 \leq r \leq 1$
- **r = 1:** perfect positive linear association
- **r = 0:** no linear association (but nonlinear association may exist!)
- **Correlation ≠ causation!**
Вопросы для размышления
- Why does r = 0 not imply independence?
- How does an outlier affect correlation?
- How does Spearman's correlation differ from Pearson's?