Data Science
Ensemble Methods
Netflix Prize 2009: $1,000,000 for a recommendation algorithm 10% better than Netflix. The winners BellKor's Pragmatic Chaos reached the target by combining more than 800 different models into one mega-ensemble. Not a single one of the 800 models was sufficient alone. This is not magic - it is mathematics: errors of different models are partially independent and cancel each other when averaged.
- **Kaggle competitions:** ensemble methods appear in winning solutions for virtually every tabular data competition
- **Financial risk models:** banks build ensembles of dozens of models for credit scoring - stability matters more than peak accuracy from one model
- **Medical diagnostics:** ensembles of models on radiology images reach accuracy comparable to a panel of physicians
Random Forest
2001: Leo Breiman publishes Random Forest. The idea - train 100-500 decision trees, each on a random data subsample (bagging) and a random feature subset, then average their predictions. No single tree knows the truth, but collective voting consistently beats single trees.
**Two sources of randomness:** Bootstrap sampling (each tree sees ~63% of the training data with replacement) and random feature selection (only sqrt(n_features) features are considered at each split). This creates diversity among trees - the key condition for an effective ensemble.
**Feature Importance:** Random Forest automatically computes feature importance via mean impurity decrease (Gini importance). Useful for feature selection, but with a known issue: high-cardinality features receive inflated importance scores.
In Random Forest, 'Random' refers to:
XGBoost: Gradient Boosting
2016: Tianqi Chen and Carlos Guestrin publish XGBoost. For the next several years it dominates tabular Kaggle competitions. The key difference from Random Forest: trees are built sequentially rather than independently - each new tree corrects the mistakes of the previous ones.
**Gradient Boosting:** at each step, build a tree that predicts not the target variable but the **residuals** (pseudo-gradients) of the current model. The final model is the sum of all trees weighted by the learning rate. XGBoost adds L1/L2 regularisation directly into the loss function and uses second-order optimisation (Newton's method).
**LightGBM vs XGBoost:** LightGBM grows trees leaf-wise rather than level-wise. On large datasets (>100k rows) LightGBM is 10-20x faster. CatBoost handles categorical features natively without manual encoding. All three are gradient boosting variants.
What does each subsequent tree predict in gradient boosting?
Stacking
In the final stage of the 2012 Kaggle click-prediction competition, the winning team gained 2% on top of an already strong ensemble using a technique called stacking: a meta-model learns from the predictions of base models as if they were new features.
**Problem with naive stacking:** if base models are trained on the same data as the meta-model, data leakage occurs - the meta-model sees 'too good' base predictions on training data. Solution: out-of-fold (OOF) predictions via k-fold cross-validation.
**When stacking helps:** base models must be as diverse as possible (different algorithms, different feature sets). Stacking two similar models gives minimal gain. Best results come from combining linear models, trees, and neural networks.
Why does stacking use OOF (out-of-fold) predictions rather than plain training predictions?
Blending and Final Ensembles
Blending is a simplified stacking variant: instead of k-fold OOF, a fixed hold-out set (typically 10-20% of data) is used. Base models train on 80%, their predictions on the remaining 20% become meta-features. Easier to implement, but loses some training data.
**Law of diminishing returns in ensembles:** first model gives, say, 0.85 AUC. Adding a second diverse model -> 0.87. Third -> 0.875. Tenth -> 0.877. Each additional layer adds less at greater complexity. Kaggle winners stack dozens of models for the last 0.001 AUC.
The more models in an ensemble, the better the result
An ensemble improves when diverse models are added, but after a point the gain is near zero while complexity and training time rise sharply
Law of diminishing returns: each new model adds information only if its errors correlate with the rest of the ensemble differently. After 5-10 diverse models, error correlations stop falling - adding more does not help.
The key difference between blending and stacking:
Ensemble Methods
- Random Forest: parallel ensemble with bagging and random feature selection - reduces variance, preserves bias
- XGBoost: sequential boosting - each tree corrects previous errors, reduces bias iteratively
- Stacking: meta-model trains on OOF predictions of base models - no data leakage
- Blending: simplified stacking via hold-out, loses some data but easier to implement
- Model diversity matters more than model count: correlated errors do not cancel each other
Related Topics
Ensemble methods build on top of base algorithms - understanding decision trees is a prerequisite.
- Decision Trees and Overfitting — Random Forest and XGBoost are built from trees - the fundamental building block
- Regularisation and Validation — Early stopping in XGBoost and OOF in stacking both combat ensemble overfitting
- Clustering and Segmentation — Next step: unsupervised learning - a different modelling paradigm
Вопросы для размышления
- Random Forest and XGBoost both build ensembles of trees - what is the fundamental difference in how they handle errors, and when is one preferable over the other?
- Why does stacking produce a gain when the meta-model is a simple logistic regression and the base models are far more complex?
- What is the practical limit on the number of models in an ensemble - and how should one decide when to stop?