Decision Trees — Rapid Q&A Refresher (DSBA)

Date: December 21, 2025 · Author: P Baburaj Ambalam

Version 2.0 · Last updated: December 21, 2025

Technique Description

Decision Trees are non-parametric models for classification and regression that partition the feature space into axis-aligned regions via recursive binary splits. A non-parametric model does not assume a fixed functional form or fixed number of parameters for the relationship between inputs and outputs; instead, it adapts its structure entirely based on the training data. Decision Trees do not assume linearity, normality, or any specific distribution, making them flexible for modeling highly complex, nonlinear relationships with few assumptions about the data.

At each node, a split is chosen to maximize impurity reduction: for classification, common impurities are Gini G = 1 − Σ p_k² and entropy H = −Σ p_k log₂ p_k; for regression, splits minimize within-node variance (MSE). Trees are powerful for handling heterogeneous feature types, capturing non-linear interactions, and providing interpretable rules. However, they can overfit easily without pruning or constraints and require more data to generalize well compared to parametically compact models. Training proceeds greedily; regularization includes limiting depth and post-pruning via cost-complexity (ccp_alpha).

Explain the Technique (Four Levels)

For a 10-year-old

A tree asks simple yes/no questions about your data until it reaches an answer; each question splits the possibilities.

For a beginner student

A decision tree makes rule-based splits on features; each path ends in a leaf with a prediction.

For an intermediate student

Greedy recursive partitioning maximizes impurity reduction; control overfitting with depth limits and cost-complexity pruning.

For an expert

Axis-aligned partitions optimized by information gain; `ccp_alpha` regularizes subtree growth, and ensembles mitigate variance.

When to Use This Technique

Ideal Use Cases

Tabular data with mixed feature types (numeric, categorical)
Need for model interpretability and explainable rules
Non-linear relationships without manual feature engineering
Quick baseline models and exploratory analysis
Small to medium datasets where stability is less critical

Avoid When

High variance/instability is unacceptable → Use Random Forests or Boosting
Data has many irrelevant features → Combine with feature selection
Need smooth decision boundaries → Use linear models or neural networks
Extrapolation beyond training range is required → Trees don't extrapolate well

Related Techniques

→ Random Forests (reduces variance via bagging)
→ Boosting (reduces bias via sequential correction)
→ Feature Engineering (preprocessing for better splits)

Q&A

What is a Decision Tree?A tree-structured model that makes sequential splits on features to predict labels or values.
What does "non-parametric model" mean?A model that does not assume a fixed functional form or fixed number of parameters; it adapts its structure based entirely on training data without assuming linearity, normality, or specific distributions.
When are Decision Trees most useful?On tabular data with mixed types, non-linear relationships, and a need for interpretability.
What is a node, split, and leaf?A node holds data; a split partitions data by feature/threshold; a leaf is a terminal node with a prediction.
Define impurity in classification.A measure of class mix at a node; lower impurity indicates purer class composition.
What is Gini impurity?G = 1 − Σ p_k², where p_k is the class proportion at the node.
What is entropy?H = −Σ p_k log₂ p_k; higher entropy means more class uncertainty.
What is information gain?Parent impurity minus the weighted child impurities achieved by a candidate split.
What criterion is used for regression trees?Minimizing mean squared error (MSE) or variance within nodes.
How do trees handle numeric vs categorical features?Numeric: threshold splits; categorical: typically one-hot encode and split on encoded columns.
How to deal with missing values?Impute with SimpleImputer; optionally add missing indicators.
Key hyperparameters to control overfitting?max_depth, min_samples_split, min_samples_leaf, max_features, ccp_alpha.
What does ccp_alpha do?Controls cost-complexity pruning; larger values remove more weak subtrees to reduce overfitting.
What is pre-pruning vs post-pruning?Pre-pruning limits growth via hyperparameters; post-pruning removes branches after full growth using a penalty.
What is the bias–variance trade-off in trees?Deeper trees reduce bias but increase variance; pruning mitigates variance.
Which metrics are suited for classification?Accuracy, precision, recall, F1, ROC-AUC.
Which metrics are suited for regression?RMSE, MAE, and R².
How to evaluate reliably?Use stratified cross-validation for classification; KFold or TimeSeriesSplit when appropriate.
How to visualize a tree?Use plot_tree or export_text.
How to interpret feature importance?Use impurity-based or permutation importance.
Strategy for class imbalance?Use stratified splits, class_weight='balanced', and metrics beyond accuracy.
Typical workflow?Split, preprocess, train, tune, evaluate, visualize.
Trees vs Random Forests?Forests average many trees to reduce variance.
Trees vs Gradient Boosting?Boosting reduces errors sequentially and often outperforms single trees.
Common pitfalls?Leakage, improper encoding, too-deep trees, ignoring imbalance.
When to cap max_depth?When overfitting or for interpretability; start with 3–8.
What is pruning and why is it important?Pruning removes branches to reduce overfitting; controlled via max_depth, min_samples_split, ccp_alpha.
What is cost-complexity pruning (ccp_alpha)?Post-pruning method that penalizes tree size; higher alpha leads to smaller, more generalizable trees.
How does a tree make predictions on new data?Traverses from root to leaf based on feature tests; returns the leaf's majority class or mean value.

Python Example

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(max_depth=5, criterion='gini', random_state=42)
clf.fit(X_train, y_train)

Quiz (15)

Which impurity measure is G = 1 − Σ p_k²?
What does entropy measure in a node?
When might you prefer entropy over Gini?
What is information gain?
Which criterion is typical for regression trees?
What does max_depth control?
How does min_samples_leaf affect variance?
What does ccp_alpha control?
How should you handle class imbalance during splitting?
How do trees usually handle categorical variables?
Why use stratified cross-validation?
Which visual tool prints text rules for a tree?
What does max_features do in a tree splitter?
Why are trees prone to overfitting?
How does a Random Forest differ from a single tree?

Practical Checklist

Define target metric aligned to business goal.
Stratify splits for classification tasks.
Encode categoricals and impute missing values.
Limit depth and set min_samples_leaf to reduce variance.
Consider class_weight='balanced' if imbalanced.
Use cross-validation for robust estimates.
Inspect tree with export_text/plot_tree.
Evaluate permutation importance.
Test pruning via ccp_alpha or tune max_depth.
Compare against baselines (logistic/regression, RF, GB).

Common Implementation Errors (10)

Fitting imputers/encoders on full data (leakage) instead of within CV folds.
Using ordinal encoding on nominal categories, introducing false order and biased splits.
Ignoring class imbalance and evaluating only with accuracy.
Letting trees grow too deep without pruning or min_samples_leaf, causing high variance.
Misinterpreting impurity reductions due to uninformative splits on noisy features.
Not setting random_state, leading to irreproducible results.
Handling missing values inconsistently between train and test.
Expecting max_features in a single tree to behave like Random Forest decorrelation.
Expanding feature space via one-hot without controlling depth/leaf sizes.
Assuming a small visualized subtree reflects the entire model behavior.

Quiz Answers

Gini.
Node class uncertainty/impurity.
When you want a more information-theoretic measure; often similar to Gini.
Parent impurity minus weighted child impurities.
Mean squared error (variance minimization).
Maximum path length; deeper allows more complex fits.
Larger leaves reduce variance by requiring more samples per leaf.
Cost-complexity pruning strength.
Use stratified splits and optionally class_weight='balanced'.
One-hot encode then split on encoded columns.
To preserve class ratios across folds.
export_text.
Limits number of features considered at each split.
They fit sharp boundaries and can memorize noise when deep.
A forest averages many randomized trees to reduce variance.

References

scikit-learn User Guide — Decision Trees
scikit-learn API — DecisionTreeClassifier
Breiman, Friedman, Olshen, Stone — Classification and Regression Trees (1984)