Feature Engineering & Cross Validation — Rapid Q&A Refresher (DSBA)

Date: December 20, 2025 · Author: P Baburaj Ambalam
← Back to index
Version 2.0 · Last updated: December 21, 2025

Technique Description

Feature Engineering transforms raw data into informative representations that improve model performance and robustness. Core techniques include scaling numeric variables (StandardScaler, MinMaxScaler), encoding categorical data (OneHotEncoder, carefully applied OrdinalEncoder), handling missing values (SimpleImputer, IterativeImputer), and constructing domain-specific features (interactions, bins, polynomial terms). Feature selection reduces dimensionality and noise via filter, wrapper (RFE), or embedded methods (SelectFromModel).

Cross Validation (CV) estimates generalization by partitioning data into multiple train/validation splits tailored to data characteristics: StratifiedKFold for imbalanced classification, GroupKFold to respect group boundaries, and TimeSeriesSplit for temporal ordering. Proper CV prevents leakage by fitting preprocessing only on training folds, typically enforced via Pipeline and ColumnTransformer. Metric choice should reflect business objectives.

Explain the Technique (Four Levels)

For a 10-year-old

We tidy and translate data so the computer understands it, and we test fairly by slicing data into parts.

For a beginner student

Scale numbers, encode categories, and fill missing values; cross-validation splits data into folds to estimate performance reliably.

For an intermediate student

Use Pipelines and ColumnTransformer to avoid leakage by fitting transforms only on training folds; pick Stratified/Group/TimeSeries CV appropriately.

For an expert

Per-column pipelines with hyperparameters, leakage audits, metric calibration, and CV strategy matched to dependencies for trustworthy estimates.

When to Use This Technique

Ideal Use Cases

Avoid When

Related Techniques

ML Pipelines (combines preprocessing and models)
Decision Trees (benefits from feature engineering)
Random Forests (less sensitive to scaling but needs encoding)

Q&A

Python Example

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'age': [25, np.nan, 35, 40, 28],
    'income': [50000, 60000, 75000, 80000, 55000],
    'city': ['NYC', 'LA', 'NYC', 'SF', 'LA'],
    'target': [0, 1, 1, 0, 1]
})

X = df.drop('target', axis=1)
y = df['target']

# Define preprocessing for numeric and categorical columns
numeric_features = ['age', 'income']
categorical_features = ['city']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create full pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Cross-validation (no leakage - transforms fit on each fold)
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.2f} (+/- {scores.std():.2f})")

# Fit on all data
clf.fit(X, y)

Quiz (15)

  1. What is data leakage?
  2. When should you use StratifiedKFold?
  3. How do you handle unseen categories in one-hot encoding?
  4. Why wrap preprocessing and model in a Pipeline?
  5. When is IterativeImputer useful?
  6. What caution applies to target encoding?
  7. Difference between filter and wrapper feature selection?
  8. When should you use GroupKFold?
  9. Why use TimeSeriesSplit for temporal data?
  10. Which metrics suit regression here?
  11. Does scaling affect tree models much?
  12. What does ColumnTransformer enable?
  13. Why run nested cross-validation?
  14. Typical range for number of CV folds?
  15. How can you calibrate predicted probabilities?

Practical Checklist

Quiz Answers

  1. Using validation/test information during preprocessing/modeling, inflating performance.
  2. Imbalanced classification to preserve class ratios in folds.
  3. Set OneHotEncoder handle_unknown='ignore'.
  4. Prevent leakage and ensure consistent transforms across folds and inference.
  5. When multivariate relationships help estimate missing values.
  6. Apply within folds and regularize to avoid leakage/overfit.
  7. Filter is univariate; wrapper searches subsets via a model.
  8. When observations share groups that must not split across train/val.
  9. To respect temporal order and avoid lookahead bias.
  10. RMSE, MAE, R^2.
  11. Little; trees are scale-invariant; scaling matters more for other models.
  12. Different preprocessing per column subsets in one object.
  13. Outer loop estimates generalization; inner loop tunes hyperparameters.
  14. Commonly 5–10 folds depending on data.
  15. CalibratedClassifierCV or other calibration techniques.

Common Implementation Errors (10)

References

  1. scikit-learn User Guide — Preprocessing
  2. scikit-learn User Guide — Feature selection
  3. scikit-learn User Guide — Model selection