Feature Engineering & Cross Validation — Rapid Q&A Refresher (DSBA)

Date: December 20, 2025 · Author: P Baburaj Ambalam

Version 2.0 · Last updated: December 21, 2025

Technique Description

Feature Engineering transforms raw data into informative representations that improve model performance and robustness. Core techniques include scaling numeric variables (StandardScaler, MinMaxScaler), encoding categorical data (OneHotEncoder, carefully applied OrdinalEncoder), handling missing values (SimpleImputer, IterativeImputer), and constructing domain-specific features (interactions, bins, polynomial terms). Feature selection reduces dimensionality and noise via filter, wrapper (RFE), or embedded methods (SelectFromModel).

Cross Validation (CV) estimates generalization by partitioning data into multiple train/validation splits tailored to data characteristics: StratifiedKFold for imbalanced classification, GroupKFold to respect group boundaries, and TimeSeriesSplit for temporal ordering. Proper CV prevents leakage by fitting preprocessing only on training folds, typically enforced via Pipeline and ColumnTransformer. Metric choice should reflect business objectives.

Explain the Technique (Four Levels)

For a 10-year-old

We tidy and translate data so the computer understands it, and we test fairly by slicing data into parts.

For a beginner student

Scale numbers, encode categories, and fill missing values; cross-validation splits data into folds to estimate performance reliably.

For an intermediate student

Use Pipelines and ColumnTransformer to avoid leakage by fitting transforms only on training folds; pick Stratified/Group/TimeSeries CV appropriately.

For an expert

Per-column pipelines with hyperparameters, leakage audits, metric calibration, and CV strategy matched to dependencies for trustworthy estimates.

When to Use This Technique

Ideal Use Cases

Any supervised learning project requiring robust validation
Raw data with mixed types, missing values, or scale differences
Imbalanced, grouped, or temporal data requiring specialized CV
Building production pipelines that need consistent transforms
Projects where leakage prevention is critical

Avoid When

Data is already clean and scaled → Minimal preprocessing needed
Very small datasets → CV estimates have high variance
Simple exploratory analysis → Train/test split may suffice

Related Techniques

→ ML Pipelines (combines preprocessing and models)
→ Decision Trees (benefits from feature engineering)
→ Random Forests (less sensitive to scaling but needs encoding)

Q&A

Why scale features?To ensure comparable units and stable optimization.
One-hot vs ordinal encoding?One-hot preserves nominal categories; ordinal implies order.
Handle unseen categories in OneHotEncoder?Set handle_unknown="ignore".
When to add missing indicators?When missingness carries signal.
What is leakage?Using validation/test info in preprocessing or modeling.
Avoid leakage?Use Pipeline/ColumnTransformer and fit only on training folds.
IterativeImputer use-case?Multivariate imputations leveraging feature relationships.
Target encoding caution?Apply within CV folds; regularize to avoid leakage.
Filter vs wrapper selection?Filter: univariate criteria; Wrapper: model-based subset search.
CV strategies overview?StratifiedKFold, GroupKFold, TimeSeriesSplit.

Python Example

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'age': [25, np.nan, 35, 40, 28],
    'income': [50000, 60000, 75000, 80000, 55000],
    'city': ['NYC', 'LA', 'NYC', 'SF', 'LA'],
    'target': [0, 1, 1, 0, 1]
})

X = df.drop('target', axis=1)
y = df['target']

# Define preprocessing for numeric and categorical columns
numeric_features = ['age', 'income']
categorical_features = ['city']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create full pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Cross-validation (no leakage - transforms fit on each fold)
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.2f} (+/- {scores.std():.2f})")

# Fit on all data
clf.fit(X, y)

Quiz (15)

What is data leakage?
When should you use StratifiedKFold?
How do you handle unseen categories in one-hot encoding?
Why wrap preprocessing and model in a Pipeline?
When is IterativeImputer useful?
What caution applies to target encoding?
Difference between filter and wrapper feature selection?
When should you use GroupKFold?
Why use TimeSeriesSplit for temporal data?
Which metrics suit regression here?
Does scaling affect tree models much?
What does ColumnTransformer enable?
Why run nested cross-validation?
Typical range for number of CV folds?
How can you calibrate predicted probabilities?

Practical Checklist

Audit for leakage.
Encode categoricals; handle unknowns.
Impute missing values.
Scale numeric features.
Add domain features.
Align CV strategy with data.
Choose proper metrics.
Use Pipeline/ColumnTransformer.
Report CV mean/std.
Evaluate on hold-out.

Quiz Answers

Using validation/test information during preprocessing/modeling, inflating performance.
Imbalanced classification to preserve class ratios in folds.
Set OneHotEncoder handle_unknown='ignore'.
Prevent leakage and ensure consistent transforms across folds and inference.
When multivariate relationships help estimate missing values.
Apply within folds and regularize to avoid leakage/overfit.
Filter is univariate; wrapper searches subsets via a model.
When observations share groups that must not split across train/val.
To respect temporal order and avoid lookahead bias.
RMSE, MAE, R^2.
Little; trees are scale-invariant; scaling matters more for other models.
Different preprocessing per column subsets in one object.
Outer loop estimates generalization; inner loop tunes hyperparameters.
Commonly 5–10 folds depending on data.
CalibratedClassifierCV or other calibration techniques.

Common Implementation Errors (10)

Fitting scalers/encoders on the full dataset before CV, causing leakage.
Applying target encoding outside CV folds without regularization, leaking target statistics.
Using plain KFold for imbalanced classes instead of StratifiedKFold.
Splitting groups across folds when group dependence exists (use GroupKFold).
Using standard KFold for time series and breaking temporal order (use TimeSeriesSplit).
Not handling unseen categories; OneHotEncoder without handle_unknown="ignore".
Mismatched train/test columns after one-hot due to rare categories or min_frequency differences.
Imputers fit before splitting; validation distribution contaminates training estimates.
Scaling binary indicator variables unnecessarily or inconsistently.
Choosing metrics misaligned with business goals (e.g., accuracy instead of F1/ROC-AUC for imbalance).