Machine Learning

Decision Trees

Three lineages of the decision tree

Modern decision trees grew from three separate roots. In 1963 the social scientists James Morgan and John Sonquist built AID (Automatic Interaction Detection), an early program that split survey data into subgroups to study social patterns. In 1984 the statisticians Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published CART (Classification and Regression Trees), giving the field a rigorous statistical foundation with binary splits and cost-complexity pruning. In parallel, the computer scientist Ross Quinlan came from the AI side: his ID3 algorithm (1986) used information gain to choose splits, and his 1993 successor C4.5 added handling of continuous features, missing values, and pruning. Together these lines define how trees are grown today.

When a doctor makes a diagnosis, they ask a series of questions: is there a fever? If yes - above 38? Is there a cough? Dry or wet? Each answer narrows the list of possible diagnoses. Decision trees work exactly the same way - they split data with a series of questions, from general to specific. But how does the algorithm decide which question to ask first? Why is asking about temperature more useful than asking about eye color? The mathematics behind this is Information Gain and Gini Impurity - metrics that measure how much each question reduces uncertainty.

**Bank credit scoring:** decision trees at major banks determine whether to approve a loan - and unlike neural networks, they can explain why: "rejected because income < 30k and late payments > 2 in a year". Regulators require exactly this kind of transparency
**Medical diagnosis:** a tree classifier in the emergency room triages patients by urgency: if pulse > 120 and blood pressure < 90 - immediate care. Simplicity of interpretation saves lives when there is no time to wait for a complex model

Machine Learning

Decision Trees

Three lineages of the decision tree

**Bank credit scoring:** decision trees at major banks determine whether to approve a loan - and unlike neural networks, they can explain why: "rejected because income < 30k and late payments > 2 in a year". Regulators require exactly this kind of transparency
**Medical diagnosis:** a tree classifier in the emergency room triages patients by urgency: if pulse > 120 and blood pressure < 90 - immediate care. Simplicity of interpretation saves lives when there is no time to wait for a complex model

Pruning: fighting overfitting

If a decision tree is left unconstrained, it will grow until every leaf is pure - one example per leaf. Such a tree memorizes the training set with **100% accuracy**, but performs terribly on new data. This is classic **overfitting** - the model learned noise instead of patterns. A tree of depth 30 for 1000 examples will create a unique "rule" for every training example, including anomalies and data errors.

There are two approaches to pruning. **Pre-pruning** (early stopping) - restrict tree growth *before* fully building it. **Post-pruning** - first grow the full tree, then cut branches that do not improve quality on the validation set.

**Advantages of decision trees:** - **Interpretability** - a tree can be visualized and explained to stakeholders: "the client gets the loan because income > 50k AND tenure > 2 years" - **No normalization needed** - trees work with raw features, no scaling required - **Work with categorical features** - no one-hot encoding needed (for CART in sklearn it is still required, but ID3/C4.5 work directly) - **Feature importance for free** - the tree automatically ranks features by usefulness

**Disadvantages of decision trees:** - **Instability** - a small change in data (remove 1 example) can completely restructure the tree. This is a consequence of the greedy algorithm: changing the root split changes the entire tree - **Axis-aligned splits** - the tree divides space only parallel to axes (x1 < 5, x2 < 3). Diagonal boundaries are approximated by a "staircase" of many splits - **Bias toward features with many values** - Information Gain prefers features with many unique values Exactly because of instability, **Random Forest** (ensemble of random trees) and **Gradient Boosting** were invented - they solve the main problems of a single tree.

Decision trees are a weak algorithm not worth studying, since neural networks are more accurate

Decision trees are the foundation of the most powerful ensemble methods (Random Forest, XGBoost, LightGBM), which win most competitions on tabular data

A single tree does fall short of neural networks. But an ensemble of hundreds of trees (Random Forest) or a sequence of trees correcting each other's mistakes (Gradient Boosting) sets the bar for tabular data. XGBoost and LightGBM dominate Kaggle precisely because they use decision trees as the basic building block

An unconstrained decision tree shows 100% accuracy on train and 58% on test. Which parameter will help THE MOST?

Decision Trees

Three lineages of the decision tree

Decision Trees

Three lineages of the decision tree

Предварительные знания

Entropy: a measure of uncertainty

Information Gain: choosing the best question

Gini Impurity: an alternative to entropy

Pruning: fighting overfitting

Key ideas

Related topics

Вопросы для размышления

Связанные уроки