Bagging and Random Forests — Rapid Q&A Refresher (DSBA)

Date: December 21, 2025 · Author: P Baburaj Ambalam
← Back to index
Version 2.0 · Last updated: December 21, 2025

Technique Description

Bagging (Bootstrap Aggregating) reduces variance by training many base learners on bootstrap-resampled datasets and averaging predictions (or majority voting). It works best for unstable learners like decision trees. Random Forests extend bagging by sub-sampling features at each split (max_features), decorrelating trees and enhancing variance reduction. OOB (out-of-bag) samples give internal generalization estimates.

Explain the Technique (Four Levels)

For a 10-year-old

Many small tree models each make a guess, then they vote together for a better answer.

For a beginner student

Bagging trains many models on bootstrap samples and averages them; Random Forest adds random feature selection per split to decorrelate trees.

For an intermediate student

Averaging reduces variance; `max_features` decorrelates trees and out-of-bag samples provide internal validation.

For an expert

Bootstrap aggregation for unstable learners; RF’s mtry lowers correlation among base learners; beware importance bias, prefer permutation importance.

When to Use This Technique

Ideal Use Cases

Q&A

Python Example

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(n_estimators=300, max_depth=None, max_features='sqrt', n_jobs=-1, random_state=42, oob_score=True)
rf.fit(X_train, y_train)

Quiz (15)

  1. What does bagging primarily reduce?
  2. Why use bootstrap sampling?
  3. What are OOB samples used for?
  4. Why feature subsampling in Random Forests?
  5. What is the default max_features for classification in scikit-learn RF?
  6. How do you decide n_estimators?
  7. Which hyperparameters curb overfitting in RF?
  8. How do you handle class imbalance in RF training?
  9. Which importance method is more reliable than impurity importance?
  10. Does RF require heavy feature scaling?
  11. Why set random_state in RF?
  12. What does oob_score_ report for regression?
  13. When should you limit max_depth?
  14. Do RF regressors extrapolate beyond the training range?
  15. When to prefer RF over a single tree?

Practical Checklist

Common Implementation Errors (10)

Quiz Answers

  1. Variance.
  2. To create diverse training sets for each learner.
  3. Estimating generalization error internally.
  4. To decorrelate trees and strengthen variance reduction.
  5. sqrt(n_features).
  6. Increase until performance plateaus and variance stabilizes.
  7. Limit depth, raise min_samples_leaf, adjust max_features, and use bootstrap with OOB checks.
  8. Use stratified splits, class_weight='balanced', and appropriate metrics.
  9. Permutation importance.
  10. Minimal scaling; only needed for distance-sensitive models.
  11. For reproducible sampling and feature subsampling.
  12. The R^2 (or chosen metric) computed on OOB samples.
  13. When overfitting appears or for interpretability constraints.
  14. No, they average observed regions and do not extrapolate well.
  15. When you need stronger accuracy/stability with minimal tuning on tabular data.

References

  1. scikit-learn User Guide — Ensemble methods
  2. scikit-learn API — RandomForestClassifier
  3. Breiman — Random Forests (2001)
  4. Breiman — Bagging Predictors (1996)