Data Science

Ensemble Methods

Netflix Prize 2009: $1,000,000 for a recommendation algorithm 10% better than Netflix. The winners BellKor's Pragmatic Chaos reached the target by combining more than 800 different models into one mega-ensemble. Not a single one of the 800 models was sufficient alone. This is not magic - it is mathematics: errors of different models are partially independent and cancel each other when averaged.

**Kaggle competitions:** ensemble methods appear in winning solutions for virtually every tabular data competition
**Financial risk models:** banks build ensembles of dozens of models for credit scoring - stability matters more than peak accuracy from one model
**Medical diagnostics:** ensembles of models on radiology images reach accuracy comparable to a panel of physicians

Random Forest

2001: Leo Breiman publishes Random Forest. The idea - train 100-500 decision trees, each on a random data subsample (bagging) and a random feature subset, then average their predictions. No single tree knows the truth, but collective voting consistently beats single trees.

**Two sources of randomness:** Bootstrap sampling (each tree sees ~63% of the training data with replacement) and random feature selection (only sqrt(n_features) features are considered at each split). This creates diversity among trees - the key condition for an effective ensemble.

**Feature Importance:** Random Forest automatically computes feature importance via mean impurity decrease (Gini importance). Useful for feature selection, but with a known issue: high-cardinality features receive inflated importance scores.

In Random Forest, 'Random' refers to:

XGBoost: Gradient Boosting

2016: Tianqi Chen and Carlos Guestrin publish XGBoost. For the next several years it dominates tabular Kaggle competitions. The key difference from Random Forest: trees are built sequentially rather than independently - each new tree corrects the mistakes of the previous ones.

**Gradient Boosting:** at each step, build a tree that predicts not the target variable but the **residuals** (pseudo-gradients) of the current model. The final model is the sum of all trees weighted by the learning rate. XGBoost adds L1/L2 regularisation directly into the loss function and uses second-order optimisation (Newton's method).

**LightGBM vs XGBoost:** LightGBM grows trees leaf-wise rather than level-wise. On large datasets (>100k rows) LightGBM is 10-20x faster. CatBoost handles categorical features natively without manual encoding. All three are gradient boosting variants.

What does each subsequent tree predict in gradient boosting?

Stacking

In the final stage of the 2012 Kaggle click-prediction competition, the winning team gained 2% on top of an already strong ensemble using a technique called stacking: a meta-model learns from the predictions of base models as if they were new features.

**Problem with naive stacking:** if base models are trained on the same data as the meta-model, data leakage occurs - the meta-model sees 'too good' base predictions on training data. Solution: out-of-fold (OOF) predictions via k-fold cross-validation.

**When stacking helps:** base models must be as diverse as possible (different algorithms, different feature sets). Stacking two similar models gives minimal gain. Best results come from combining linear models, trees, and neural networks.

Why does stacking use OOF (out-of-fold) predictions rather than plain training predictions?

Blending and Final Ensembles

Blending is a simplified stacking variant: instead of k-fold OOF, a fixed hold-out set (typically 10-20% of data) is used. Base models train on 80%, their predictions on the remaining 20% become meta-features. Easier to implement, but loses some training data.

**Law of diminishing returns in ensembles:** first model gives, say, 0.85 AUC. Adding a second diverse model -> 0.87. Third -> 0.875. Tenth -> 0.877. Each additional layer adds less at greater complexity. Kaggle winners stack dozens of models for the last 0.001 AUC.

The more models in an ensemble, the better the result

An ensemble improves when diverse models are added, but after a point the gain is near zero while complexity and training time rise sharply

Law of diminishing returns: each new model adds information only if its errors correlate with the rest of the ensemble differently. After 5-10 diverse models, error correlations stop falling - adding more does not help.

The key difference between blending and stacking: