Machine Learning

Random Forest

Random subspaces, then Random Forests

In 1995, Tin Kam Ho at Bell Labs introduced the random subspace method: instead of training each tree on every feature, train it on a random subset of features. This decorrelated the trees and let an ensemble keep improving without overfitting, an idea that ran against the intuition of the time. In 2001, Leo Breiman combined that feature randomness with bootstrap sampling of the data and published the paper "Random Forests". The recipe was simple and remarkably robust, and it quickly became one of the most widely used algorithms in applied machine learning, a default first model on tabular data for two decades.

A single decision tree is an unstable classifier: remove a couple of observations from the training set and the tree completely restructures itself, giving entirely different predictions. What if instead of one tree we train 500 on different subsets and let them vote? This simple idea made Random Forest one of the most reliable algorithms in machine learning - it wins on Kaggle, works out of the box without complex tuning, and is still used in production at banks, hospitals, and tech companies.

**Bank credit scoring** - Random Forest evaluates creditworthiness by analyzing dozens of features (income, history, age), and feature importance explains the reason for rejection to the client as required by law (GDPR, FCRA)
**Medical diagnostics** - an ensemble of 1000 trees classifies tumors based on 30 biomarkers with 96%+ accuracy, and permutation importance shows the doctor which markers are decisive for a specific patient
**Recommendation systems** - Random Forest predicts whether a user will like a product, processing hundreds of features (purchase history, demographics, time of day) in parallel on all server cores in milliseconds

Предварительные знания

Decision Trees

Bagging: Bootstrap Aggregating

A single decision tree overfits easily: it memorizes noise in the data and produces unstable predictions. Remove one observation from the training set - the tree may completely restructure. This is high **variance**. The idea of **bagging** (Bootstrap AGGregatING) is simple: instead of one tree, train **N trees on different subsets** and take the average (regression) or majority vote (classification). Each tree makes its own mistakes, but on average the mistakes cancel out.

How do we get N different subsets from a single dataset? The **bootstrap** technique: from N observations we sample N observations **with replacement**. Some observations appear twice, three times, and some not at all. The probability of one observation NOT being selected: (1 - 1/N)^N. As N -> infinity this converges to 1/e = 0.368. So each bootstrap sample contains about **63.2% unique** observations and **36.8% duplicates** (while 36.8% of the original data doesn't appear at all).

Why does this work? The analogy: **wisdom of the crowd**. In 1906 at a fair, 787 people guessed the weight of a bull. Individual estimates varied wildly, but the *average* of all estimates was less than 1% off from the real weight. Each person made their own error (random noise), but on average the noise cancelled out. Bagging works the same way: each tree overfits on its own subset of data, but **aggregation reduces variance** without increasing bias.

**Bias-Variance trade-off and bagging:** - Single deep tree: **low bias, high variance** (memorizes noise) - Bagging N trees: **low bias, low variance** (noise is averaged out) Mathematically: if trees are uncorrelated and each has variance = v, then the ensemble variance = v/N. The more trees, the more stable the predictions. In practice, after 100-300 trees the improvement becomes minimal.

In a bootstrap sample of N observations (sampling with replacement from N) what fraction of the original data does NOT end up in the sample?

Random feature selection

Bagging reduces variance, but there is a problem: if the data contains one very strong feature (e.g., "age" when predicting mortality), then **all trees will split on it first**. The trees end up similar - **correlated**. Averaging correlated predictions gives a much smaller reduction in variance than averaging independent ones.

The Random Forest solution: **at every split** (not at every tree, but at every node!) randomly select a subset of m features and find the best split only among those. If there are p features in total, typical values are: **m = sqrt(p)** for classification and **m = p/3** for regression. This is the only - but fundamental - difference between Random Forest and plain bagging.

**Why exactly sqrt(p)?** This is an empirical rule found by Leo Breiman (creator of Random Forest, 2001). The intuition: - Too few features (m=1): trees are weak, high bias - Too many features (m=p): trees are correlated, high variance - **sqrt(p) - a balance** between the strength of each tree and the diversity of the ensemble With p=100 features each split sees only 10 random features. A strong feature appears in the subset with 10% probability, so most trees are forced to find alternative patterns.

The two sources of randomness in Random Forest create **double diversity**: 1. bootstrap - each tree sees different observations 2. feature sampling - each split sees different features. The combination of these two randomizations is what makes Random Forest one of the most reliable out-of-the-box algorithms - it rarely performs badly, even without hyperparameter tuning.

What is the key thing that distinguishes Random Forest from plain bagging of decision trees?

OOB-score: free validation

Recall: with bootstrap each tree trains on ~63.2% of the data. The remaining ~36.8% - **Out-of-Bag (OOB)** observations - the tree never saw. This is a natural test set! For each observation x_i we can collect predictions only from those trees whose bootstrap samples did NOT include x_i, and compute accuracy over those predictions. That is the **OOB-score**.

Each observation ends up in OOB for an average of **N/e = N * 0.368** trees. With 500 trees each observation is evaluated by ~184 trees that never saw it. That is enough for a reliable estimate. Research shows that **OOB-score is virtually identical to cross-validation** (difference is typically < 1%), but is computed for free - without additional training.

**When OOB-score is especially useful:** - **Large datasets** - cross-validation on millions of observations is expensive (5-fold = train 5 models), OOB-score is free - **Fast hyperparameter tuning** - compare n_estimators, max_depth, max_features by OOB without splitting into train/test - **Monitoring** - OOB-score grows with the number of trees and stabilizes, showing when adding more trees is pointless

**OOB-score is not a replacement for a test set in final evaluation.** OOB-score is excellent for hyperparameter tuning and monitoring, but final model evaluation still requires a separate held-out test set that was not involved in training or parameter selection. Otherwise there is a risk of information leakage through hyperparameter choices.

Why is OOB-score considered "free" validation?

Feature Importance

Random Forest not only classifies, but also shows **which features matter** for prediction. This is critically important in practice: a client wants to know not just "the customer will churn", but **why**. There are two main methods: **MDI (Mean Decrease Impurity)** - built into sklearn, and **Permutation Importance** - more reliable but slower.

**MDI (Mean Decrease Impurity):** for each feature we sum how much it decreased Gini impurity (or entropy) across all splits, across all trees. A feature that is frequently used for splits and strongly decreases impurity is considered important. Values are normalized so the sum = 1. This method is fast (computed during training), but has a serious drawback.

**MDI problem:** it is biased toward **high-cardinality features** - those with many unique values. A user ID (100,000 unique values) gets a high MDI even though it has no predictive power. Date, random noise with large spread - also inflated. For reliable conclusions, **Permutation Importance** is used.

**Permutation Importance:** take a trained model, randomly **shuffle** the values of one feature (destroying its relationship with the target) and measure how much the **accuracy drops**. If the feature is important - accuracy drops a lot. If not - it stays the same. This method does not depend on cardinality and works with any model, not just Random Forest.

**Random Forest: final checklist** **Pros:** - Works out of the box - few hyperparameters, defaults are almost always good - Robust to outliers and noise - Does not require feature normalization - Parallelism: n_jobs=-1 uses all CPU cores - OOB-score - free validation - Feature importance - data understanding **Cons:** - Slower than a single tree (training and prediction) - Less interpretable than a single tree (can't draw it) - Does not extrapolate: if prices were 100-500 in training, the model won't predict 1000 - Falls behind gradient boosting (XGBoost, LightGBM) on tabular data in competitions

The more trees in a Random Forest, the higher the risk of overfitting, so n_estimators needs to be limited

Increasing the number of trees in a Random Forest does NOT cause overfitting - OOB-score and test accuracy stabilize (plateau) but never get worse as n_estimators grows

Unlike boosting, trees in Random Forest are trained **independently** and averaged. Mathematically: ensemble variance = rho*v + (1-rho)*v/N, where N is the number of trees. As N grows the second term approaches zero, but the first (rho*v) remains - so quality stabilizes but does not worsen. Limiting n_estimators is only necessary for speed, not for quality.

Why can MDI (Mean Decrease Impurity) give incorrect feature importance estimates?

Key ideas

**Bagging (Bootstrap Aggregating):** train N trees on bootstrap samples (with replacement, ~63.2% unique), aggregate by voting/averaging - variance drops, bias stays low
**Feature sampling** - the key difference between RF and bagging: at each split only sqrt(p) random features are available, which decorrelates trees and strengthens the ensemble effect
**OOB-score** - free validation: ~36.8% of data not included in bootstrap serves as a test set for each tree, the result is comparable to cross-validation
**Feature importance:** MDI (fast, but biased toward high-cardinality features) vs Permutation Importance (reliable, but slower) - this is exactly why Random Forest is valued in banking and medicine, where the model must both predict and explain the decision, as discussed at the start

Вопросы для размышления

Random Forest cannot extrapolate - it only predicts within the range of values it saw during training. How does this limitation affect time series forecasting tasks (stock prices, temperature, sales)?
If OOB-score roughly equals cross-validation score, why use cross-validation at all? In which situations can OOB-score be unreliable?
Random Forest works well out of the box, while Gradient Boosting requires careful tuning but is usually more accurate. Which approach would you choose for the first model prototype and why?

Связанные уроки

ml-11-decision-trees — Random Forest is a bagging ensemble of decision trees
ml-21-bagging-boosting — Random Forest is specialised bagging with feature subsampling
ml-22-gradient-boosting — Gradient Boosting builds trees sequentially (vs in parallel for RF)
alg-13-dfs