Statistics
Survival Analysis
'How long will a new customer stay?' 'How many months until an engine fails?' 'Which treatment prolongs life?' - all these questions require the same approach. Survival analysis is the foundation of clinical trials, product analytics, and industrial quality control.
- Oncology: comparing survival under different cancer treatments (KM curves appear in every clinical paper)
- SaaS products: forecasting time-to-churn, LTV modelling
- Banking: time to loan default; Basel III requires survival-based models
- HR analytics: predicting employee resignation, 'Employee Lifetime Value'
- Machine reliability: determining optimal preventive maintenance schedule (MTBF)
Предварительные знания
The Censoring Problem: Why Ordinary Regression Fails
Airbnb's churn model uses Cox regression on 1M+ users: P(user leaves | survived 30 days). Spotify and Netflix run survival analysis in production - users who pass the 90-day mark have 4x lower churn rate. The key challenge: censoring - not all users reach the event before the observation window closes.
**Survival function S(t)** = P(T > t) - the probability of 'surviving' beyond time t. **Hazard function h(t)** = the instantaneous probability of the event at time t, given survival to t. Relationship: h(t) = −d/dt[log S(t)]. S(t) decreases monotonically from 1 (at t=0) toward 0.
Of 100 SaaS customers, 40 cancelled (event) and 60 are still active (censored). How should one correctly analyse time-to-churn?
Kaplan-Meier Estimator: Non-Parametric Survival Curves
The **Kaplan-Meier estimator** is a non-parametric method for estimating the survival function S(t). It makes no distributional assumptions. The result is a step function: it drops only at event times. Formula: S(t) = ∏[t_i ≤ t] (1 − d_i/n_i), where d_i = events at time t_i and n_i = subjects at risk just before t_i.
**Median survival vs mean:** the mean survival time cannot be computed if the last observation is censored (we don't know when the event would have occurred). Median survival is the time at which 50% of subjects have experienced the event. The median is the standard metric reported in clinical trial publications.
A Kaplan-Meier curve shows S(12 months) = 0.70. How do one interpret this?
Cox Regression: Proportional Hazards
The **Cox proportional hazards model** is a semi-parametric regression for analysing the effect of covariates on time to event. Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...). h₀(t) is the baseline hazard (non-parametric). The β estimates yield **hazard ratios** (HR = exp(β)). HR > 1 - increases event risk; HR < 1 - decreases risk (protective factor).
**Proportional hazards assumption:** the Cox model assumes the hazard ratio (HR) between groups is constant over time. Verify with the Schoenfeld test. If violated: stratify by the offending variable, add a time-interaction term, or use ML-based alternatives (Random Survival Forest, DeepHit).
In a Cox regression for customer churn: HR for 'premium_subscription' = 0.35, p < 0.001. What does this mean?
Key Ideas
- Censoring is the defining feature of survival data: the event is not observed for all subjects
- S(t) = P(T > t) - survival function; h(t) - hazard function
- Kaplan-Meier: non-parametric estimate of S(t), assumes no distribution
- Log-rank test: compares survival curves between groups
- Cox regression: effect of covariates via hazard ratio (HR = exp(β))
- HR > 1 - increases risk; HR < 1 - protective factor
- Proportional hazards assumption is tested with the Schoenfeld test
Connections to Other Methods
Survival analysis connects to regression (Cox extends regression to censored time data), Bayesian methods (Bayesian Cox, Weibull regression), and ML (Random Survival Forest, DeepHit).
- Linear and Logistic Regression — Cox is regression adapted for censored time-to-event data
- Non-Parametric Tests — Kaplan-Meier and the log-rank test are non-parametric methods
Вопросы для размышления
- Why can't one simply drop censored observations and apply ordinary regression? What bias does this introduce?
- If the KM curves of two groups cross - what does that imply for the Cox model? How should one handle such data?
- In the product, paid-tier users stay longer. How do one separate the true effect of the subscription from the fact that 'customers who pay are already more loyal' (confounding)?