Machine Learning

Decision Trees

Three lineages of the decision tree

Modern decision trees grew from three separate roots. In 1963 the social scientists James Morgan and John Sonquist built AID (Automatic Interaction Detection), an early program that split survey data into subgroups to study social patterns. In 1984 the statisticians Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published CART (Classification and Regression Trees), giving the field a rigorous statistical foundation with binary splits and cost-complexity pruning. In parallel, the computer scientist Ross Quinlan came from the AI side: his ID3 algorithm (1986) used information gain to choose splits, and his 1993 successor C4.5 added handling of continuous features, missing values, and pruning. Together these lines define how trees are grown today.

When a doctor makes a diagnosis, they ask a series of questions: is there a fever? If yes - above 38? Is there a cough? Dry or wet? Each answer narrows the list of possible diagnoses. Decision trees work exactly the same way - they split data with a series of questions, from general to specific. But how does the algorithm decide which question to ask first? Why is asking about temperature more useful than asking about eye color? The mathematics behind this is Information Gain and Gini Impurity - metrics that measure how much each question reduces uncertainty.

  • **Bank credit scoring:** decision trees at major banks determine whether to approve a loan - and unlike neural networks, they can explain why: "rejected because income < 30k and late payments > 2 in a year". Regulators require exactly this kind of transparency
  • **Medical diagnosis:** a tree classifier in the emergency room triages patients by urgency: if pulse > 120 and blood pressure < 90 - immediate care. Simplicity of interpretation saves lives when there is no time to wait for a complex model

0

1

Sign In

  • **Kaggle and production ML:** XGBoost and LightGBM - ensembles of thousands of decision trees - win most competitions on tabular data and are used in recommendation systems at Uber, Airbnb, Netflix
  • Предварительные знания

    • Metrics and Model Evaluation

    Entropy: a measure of uncertainty

    Consider two bags of balls. The first has 10 red balls. The second has 5 red and 5 blue. From which bag is the color of a randomly drawn ball more predictable? The first one - there is no uncertainty there. The second bag is maximally unpredictable. This exact idea is formalized by **entropy** - a measure of "disorder" or uncertainty in a dataset.

    Claude Shannon in 1948 introduced the entropy formula in information theory: **Entropy(S) = -sum(p_i * log2(p_i))**, where p_i is the fraction of elements of class i. This formula measures the average number of bits needed to encode a random outcome. The more "surprise" each outcome carries - the higher the entropy.

    **Why log2, not ln?** Because entropy is measured in **bits** - the minimal units of information. Entropy = 1 means exactly 1 bit is needed to encode one outcome (like one yes/no answer). Entropy = 0 means the outcome is fully predictable - zero bits are needed.

    For decision trees, entropy is a tool for evaluating the "quality" of a node. The tree tries to create splits such that child nodes are as **pure** as possible (entropy close to zero). If after splitting on a feature one child node contains only "yes" and the other only "no" - that is a perfect split with zero entropy in both children.

    **Edge case:** when p_i = 0, the formula contains log2(0) = -infinity. By convention 0 * log2(0) = 0, because lim(x -> 0) x * log2(x) = 0. In code this is handled by the check `if count == 0: continue`.

    A dataset has 8 examples of class A and 2 examples of class B. Which statement about the entropy of this dataset is correct?

    Information Gain: choosing the best question

    Now we have a way to measure "disorder" in data. The next question is: **which feature is best to split on?** Information Gain (IG) is the difference between entropy before the split and the weighted average entropy after. The decision tree at each step picks the feature with the **maximum** Information Gain - the one that reduces uncertainty the most.

    The classic example: deciding whether to play tennis based on the weather. The dataset contains 14 observations: 9 times played (Yes), 5 times didn't (No). Which feature best separates the data - Outlook (weather), Temperature, Humidity, or Wind?

    **Tree-building algorithm variants:** - **ID3** (Quinlan, 1986) - uses Information Gain, works only with categorical features - **C4.5** (Quinlan, 1993) - uses Gain Ratio (normalized IG), handles numeric features and missing values - **CART** (Breiman, 1984) - uses Gini Impurity, builds only binary trees; sklearn uses exactly CART

    Recursive splitting continues until one of the stopping conditions is met: all examples in a node belong to one class (pure leaf), no features remain for splitting, or a specified maximum tree depth is reached. Each path from root to leaf is a **rule**: for example, "If Outlook = Sunny AND Humidity = Normal, then play".

    Information Gain has a weakness: it **favors features with many unique values**. Consider a feature "customer ID" - each customer has a unique value, splitting by it will produce perfectly pure leaves (one example each). IG will be maximal, but such a tree is useless. That is why C4.5 uses **Gain Ratio** - Information Gain divided by the intrinsic entropy of the feature.

    A decision tree is choosing a feature to split on at a node. Feature A gives IG = 0.35, feature B gives IG = 0.12, feature C gives IG = 0.28. Which feature will the algorithm choose?

    Gini Impurity: an alternative to entropy

    Information Gain is based on entropy and the logarithm operation. But there is an alternative metric that is simpler to compute and gives nearly identical results: **Gini Impurity**. Formula: **Gini(S) = 1 - sum(p_i^2)**, where p_i is the fraction of class i in the node. Gini Impurity measures the probability of **incorrectly** classifying a randomly chosen element if we assign it a class randomly, proportional to the class distribution in the node.

    **Why does sklearn use Gini by default?** The CART (Classification And Regression Trees) algorithm implemented in sklearn historically uses Gini. Gini does not require computing a logarithm - only multiplication and addition - making it slightly faster. In practice, trees built with Gini and with Entropy agree in **99% of cases**. The difference is only noticeable on very specific data.

    **Regression Trees: MSE instead of Gini/Entropy** Decision trees work not only for classification. For regression tasks (predicting a number) a different split metric is used: - **MSE (Mean Squared Error):** mean squared error in the node - Prediction at a leaf = mean value of all examples in that leaf - The tree chooses the split that minimizes the total MSE of the children In sklearn: `DecisionTreeRegressor(criterion='squared_error')`

    To summarize: Gini Impurity and Entropy are two ways to measure the same thing - how "impure" a tree node is. Gini is computationally simpler, Entropy has an information-theoretic interpretation. In practice, the choice between them is a matter of preference, not model quality. Much more important is correctly tuning **tree depth and other constraints** - and that is the topic of the next concept.

    A tree node contains 6 examples of class A and 4 examples of class B. What is the Gini Impurity?

    Pruning: fighting overfitting

    If a decision tree is left unconstrained, it will grow until every leaf is pure - one example per leaf. Such a tree memorizes the training set with **100% accuracy**, but performs terribly on new data. This is classic **overfitting** - the model learned noise instead of patterns. A tree of depth 30 for 1000 examples will create a unique "rule" for every training example, including anomalies and data errors.

    There are two approaches to pruning. **Pre-pruning** (early stopping) - restrict tree growth *before* fully building it. **Post-pruning** - first grow the full tree, then cut branches that do not improve quality on the validation set.

    **Advantages of decision trees:** - **Interpretability** - a tree can be visualized and explained to stakeholders: "the client gets the loan because income > 50k AND tenure > 2 years" - **No normalization needed** - trees work with raw features, no scaling required - **Work with categorical features** - no one-hot encoding needed (for CART in sklearn it is still required, but ID3/C4.5 work directly) - **Feature importance for free** - the tree automatically ranks features by usefulness

    **Disadvantages of decision trees:** - **Instability** - a small change in data (remove 1 example) can completely restructure the tree. This is a consequence of the greedy algorithm: changing the root split changes the entire tree - **Axis-aligned splits** - the tree divides space only parallel to axes (x1 < 5, x2 < 3). Diagonal boundaries are approximated by a "staircase" of many splits - **Bias toward features with many values** - Information Gain prefers features with many unique values Exactly because of instability, **Random Forest** (ensemble of random trees) and **Gradient Boosting** were invented - they solve the main problems of a single tree.

    Decision trees are a weak algorithm not worth studying, since neural networks are more accurate

    Decision trees are the foundation of the most powerful ensemble methods (Random Forest, XGBoost, LightGBM), which win most competitions on tabular data

    A single tree does fall short of neural networks. But an ensemble of hundreds of trees (Random Forest) or a sequence of trees correcting each other's mistakes (Gradient Boosting) sets the bar for tabular data. XGBoost and LightGBM dominate Kaggle precisely because they use decision trees as the basic building block

    An unconstrained decision tree shows 100% accuracy on train and 58% on test. Which parameter will help THE MOST?

    Key ideas

    • **Entropy** = -sum(p_i * log2(p_i)) measures uncertainty in a node: 0 = pure node, 1 = maximum chaos (for two classes)
    • **Information Gain** = Entropy(parent) - Weighted_Avg(Entropy(children)) - the tree greedily picks the feature with the maximum IG so each question maximally narrows the uncertainty
    • **Gini Impurity** = 1 - sum(p_i^2) - computationally simpler than entropy and gives nearly the same trees; sklearn uses Gini by default (CART)
    • **Pruning** - the key protection against overfitting: max_depth, min_samples_leaf (pre-pruning) or ccp_alpha (post-pruning) prevent the tree from memorizing every training example
    • Decision trees are interpretable and need no normalization, but are unstable - which is exactly why a doctor makes a diagnosis with a series of questions, not one, and tree ensembles (Random Forest, XGBoost) combine hundreds of trees for stability and accuracy

    Related topics

    Decision trees are the foundation for ensemble methods, and understanding them is connected to evaluation metrics and other classification algorithms:

    • Model Evaluation Metrics — Accuracy, Precision, Recall, F1 - metrics we use to evaluate the tree and tune pruning hyperparameters
    • Random Forest — An ensemble of hundreds of random trees - solves the instability problem of a single tree through bagging and random feature selection
    • Gradient Boosting — A sequence of trees where each subsequent one corrects the errors of the previous - XGBoost, LightGBM, CatBoost are based on this idea
    • Naive Bayes — An alternative approach to classification via Bayes' theorem - uses probabilities instead of split trees

    Вопросы для размышления

    • A decision tree greedily picks the best feature at each step. Can it happen that a "bad" first split leads to a better tree overall? Why don't trees try all possible combinations?
    • Feature importance shows that one feature is responsible for 80% of the tree's decisions. Does that mean the other features are useless and can be removed?
    • Why in tasks where interpretability matters (medicine, finance, law) are decision trees sometimes preferred over more accurate models like neural networks?

    Связанные уроки

    • ml-12-random-forest — Random Forest is an ensemble of decision trees
    • ml-05-evaluation — Gini impurity and information gain are split-quality metrics
    • it-02 — Information gain = H(Y) - H(Y|X) from information theory
    • prob-04-bayes
    • alg-20-greedy
    Decision Trees