Statistics

Survival Analysis

'How long will a new customer stay?' 'How many months until an engine fails?' 'Which treatment prolongs life?' - all these questions require the same approach. Survival analysis is the foundation of clinical trials, product analytics, and industrial quality control.

Oncology: comparing survival under different cancer treatments (KM curves appear in every clinical paper)
SaaS products: forecasting time-to-churn, LTV modelling
Banking: time to loan default; Basel III requires survival-based models
HR analytics: predicting employee resignation, 'Employee Lifetime Value'
Machine reliability: determining optimal preventive maintenance schedule (MTBF)

Предварительные знания

Estimation: The $1.5B Mistake of the Hubble Telescope

The Censoring Problem: Why Ordinary Regression Fails

Airbnb's churn model uses Cox regression on 1M+ users: P(user leaves | survived 30 days). Spotify and Netflix run survival analysis in production - users who pass the 90-day mark have 4x lower churn rate. The key challenge: censoring - not all users reach the event before the observation window closes.

**Survival function S(t)** = P(T > t) - the probability of 'surviving' beyond time t. **Hazard function h(t)** = the instantaneous probability of the event at time t, given survival to t. Relationship: h(t) = −d/dt[log S(t)]. S(t) decreases monotonically from 1 (at t=0) toward 0.

Of 100 SaaS customers, 40 cancelled (event) and 60 are still active (censored). How should one correctly analyse time-to-churn?

Kaplan-Meier Estimator: Non-Parametric Survival Curves

The **Kaplan-Meier estimator** is a non-parametric method for estimating the survival function S(t). It makes no distributional assumptions. The result is a step function: it drops only at event times. Formula: S(t) = ∏[t_i ≤ t] (1 − d_i/n_i), where d_i = events at time t_i and n_i = subjects at risk just before t_i.

**Median survival vs mean:** the mean survival time cannot be computed if the last observation is censored (we don't know when the event would have occurred). Median survival is the time at which 50% of subjects have experienced the event. The median is the standard metric reported in clinical trial publications.

A Kaplan-Meier curve shows S(12 months) = 0.70. How do one interpret this?

Cox Regression: Proportional Hazards

The **Cox proportional hazards model** is a semi-parametric regression for analysing the effect of covariates on time to event. Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...). h₀(t) is the baseline hazard (non-parametric). The β estimates yield **hazard ratios** (HR = exp(β)). HR > 1 - increases event risk; HR < 1 - decreases risk (protective factor).

**Proportional hazards assumption:** the Cox model assumes the hazard ratio (HR) between groups is constant over time. Verify with the Schoenfeld test. If violated: stratify by the offending variable, add a time-interaction term, or use ML-based alternatives (Random Survival Forest, DeepHit).

In a Cox regression for customer churn: HR for 'premium_subscription' = 0.35, p < 0.001. What does this mean?

Key Ideas

Censoring is the defining feature of survival data: the event is not observed for all subjects
S(t) = P(T > t) - survival function; h(t) - hazard function
Kaplan-Meier: non-parametric estimate of S(t), assumes no distribution
Log-rank test: compares survival curves between groups
Cox regression: effect of covariates via hazard ratio (HR = exp(β))
HR > 1 - increases risk; HR < 1 - protective factor
Proportional hazards assumption is tested with the Schoenfeld test

Connections to Other Methods

Survival analysis connects to regression (Cox extends regression to censored time data), Bayesian methods (Bayesian Cox, Weibull regression), and ML (Random Survival Forest, DeepHit).

Linear and Logistic Regression — Cox is regression adapted for censored time-to-event data
Non-Parametric Tests — Kaplan-Meier and the log-rank test are non-parametric methods

Вопросы для размышления

Why can't one simply drop censored observations and apply ordinary regression? What bias does this introduce?
If the KM curves of two groups cross - what does that imply for the Cox model? How should one handle such data?
In the product, paid-tier users stay longer. How do one separate the true effect of the subscription from the fact that 'customers who pay are already more loyal' (confounding)?

Связанные уроки

prob-06-random-vars

The Censoring Problem: Why Ordinary Regression Fails

Of 100 SaaS customers, 40 cancelled (event) and 60 are still active (censored). How should one correctly analyse time-to-churn?

Kaplan-Meier Estimator: Non-Parametric Survival Curves

A Kaplan-Meier curve shows S(12 months) = 0.70. How do one interpret this?

Cox Regression: Proportional Hazards

In a Cox regression for customer churn: HR for 'premium_subscription' = 0.35, p < 0.001. What does this mean?

Key Ideas

Censoring is the defining feature of survival data: the event is not observed for all subjects

S(t) = P(T > t) - survival function; h(t) - hazard function

Kaplan-Meier: non-parametric estimate of S(t), assumes no distribution

Log-rank test: compares survival curves between groups

Cox regression: effect of covariates via hazard ratio (HR = exp(β))

HR > 1 - increases risk; HR < 1 - protective factor

Proportional hazards assumption is tested with the Schoenfeld test