Hierarchical Clustering & Principal Component Analysis (PCA) — Rapid Q&A Refresher (DSBA)

Date: December 21, 2025 · Author: P Baburaj Ambalam
← Back to index
Version 2.0 · Last updated: December 21, 2025

Technique Description — Hierarchical Clustering & Principal Component Analysis (PCA)

Agglomerative Hierarchical Clustering builds a tree (dendrogram) by iteratively merging the closest clusters according to a linkage criterion: Ward (minimizes total within-cluster variance, Euclidean only), complete (max of pairwise distances), average (mean of pairwise distances), among others. The choice of linkage and distance metric shapes cluster geometry. The dendrogram provides a multiscale view; selecting a cut level yields a partition with n_clusters.

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that projects data onto orthogonal directions of maximum variance. With centered data, PCA can be derived via eigen-decomposition of the covariance matrix or SVD of the data matrix. Key outputs—components_, explained_variance_, and explained_variance_ratio_—quantify directions and information retained. Selecting n_components balances compression against information loss (fixed count, variance threshold, 'mle').

Explain the Technique (Four Levels)

For a 10-year-old

We build a family tree of similar items, and make a smaller picture of the data that keeps the most important directions.

For a beginner student

Hierarchical clustering merges the closest groups step by step; PCA compresses data by keeping directions with the most variation.

For an intermediate student

Choose linkage (Ward/complete/average), cut the dendrogram at a height; scale before PCA and choose n_components by explained variance.

For an expert

Ward requires Euclidean metrics; derive PCA via SVD or covariance; whitening trade-offs apply; apply the same transform at inference.

When to Use This Technique

Ideal Use Cases

Avoid When

Related Techniques

K-means Clustering (faster partition-based alternative)
Feature Scaling (essential preprocessing for both)
Decision Trees (can use PCA features)

Q&A Pairs

Q: What is agglomerative hierarchical clustering? A: A bottom-up clustering method that merges the nearest clusters iteratively. Q: Common linkage criteria? A: Ward, complete, average; Ward minimizes within-cluster variance. Q: Why does Ward require Euclidean distance? A: It optimizes variance in Euclidean space; other metrics break this assumption. Q: How to interpret a dendrogram? A: Heights reflect merge distances; cutting at a level yields clusters. Q: How to choose n_clusters? A: Select a dendrogram cut level or use silhouette/stability analyses. Q: PCA objective? A: Maximize explained variance along orthogonal components. Q: Why center (and often scale) before PCA? A: Centering is required; scaling avoids dominance by large-scale features. Q: How to pick n_components? A: Fixed count, cumulative variance threshold (e.g., 0.95), 'mle', or None. Q: Key PCA outputs? A: components_, explained_variance_, explained_variance_ratio_, singular_values_, mean_. Q: What is whitening in PCA? A: Scaling components to unit variance; helpful for some models but changes variance structure. Q: Can PCA help clustering? A: Yes; reduces noise and dimensionality before clustering. Q: Common pitfalls? A: Misreading dendrograms, wrong metric/linkage, skipping scaling, too few/too many components.

Python Example (PCA + Agglomerative)

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
pca = PCA(n_components=2, random_state=42)
X2 = pca.fit_transform(X)
clust = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = clust.fit_predict(X2)
print('Silhouette:', silhouette_score(X2, labels))

Quiz (15)

Practical Checklist (10)

Quiz Answers

  • 1) Ward, complete, average (any two).
  • 2) Ward optimizes variance in Euclidean space; other metrics break the assumption.
  • 3) Heights show merge distances; higher merges indicate less similar clusters.
  • 4) Cut the dendrogram at a chosen height to yield that many clusters.
  • 5) To assess cluster separation quality after clustering.
  • 6) Centering is required; scaling prevents large-scale features from dominating.
  • 7) Maximize variance along orthogonal components.
  • 8) Choose a cumulative variance threshold (e.g., 0.9–0.95) or fixed count.
  • 9) Fraction of total variance explained by each component.
  • 10) Principal directions (loadings) onto which data are projected.
  • 11) Scaling components to unit variance; can help some models but distorts variance structure.
  • 12) Ward with non-Euclidean metrics.
  • 13) To denoise, reduce dimensionality, and aid cluster separation/visualization.
  • 14) randomized (fast approximate SVD).
  • 15) Distances become dominated by large-scale features, distorting clusters.

Common Implementation Errors (10)

References