Technique Description — Hierarchical Clustering & Principal Component Analysis (PCA)
Agglomerative Hierarchical Clustering builds a tree (dendrogram) by iteratively merging the closest clusters according to a linkage criterion: Ward (minimizes total within-cluster variance, Euclidean only), complete (max of pairwise distances), average (mean of pairwise distances), among others. The choice of linkage and distance metric shapes cluster geometry. The dendrogram provides a multiscale view; selecting a cut level yields a partition with n_clusters.
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that projects data onto orthogonal directions of maximum variance. With centered data, PCA can be derived via eigen-decomposition of the covariance matrix or SVD of the data matrix. Key outputs—components_, explained_variance_, and explained_variance_ratio_—quantify directions and information retained. Selecting n_components balances compression against information loss (fixed count, variance threshold, 'mle').
Explain the Technique (Four Levels)
For a 10-year-old
We build a family tree of similar items, and make a smaller picture of the data that keeps the most important directions.
For a beginner student
Hierarchical clustering merges the closest groups step by step; PCA compresses data by keeping directions with the most variation.
For an intermediate student
Choose linkage (Ward/complete/average), cut the dendrogram at a height; scale before PCA and choose n_components by explained variance.
For an expert
Ward requires Euclidean metrics; derive PCA via SVD or covariance; whitening trade-offs apply; apply the same transform at inference.
When to Use This Technique
Ideal Use Cases
Unknown number of clusters; dendrogram guides selection
Need to visualize cluster relationships hierarchically
High-dimensional data that benefits from PCA visualization
Small to moderate datasets (hierarchical is O(n²) or O(n³))
Dimensionality reduction with interpretable components (PCA)
Avoid When
Very large datasets → Use K-means or mini-batch variants
Non-linear manifold structure → Consider t-SNE, UMAP instead of PCA
Q: What is agglomerative hierarchical clustering?A: A bottom-up clustering method that merges the nearest clusters iteratively.Q: Common linkage criteria?A: Ward, complete, average; Ward minimizes within-cluster variance.Q: Why does Ward require Euclidean distance?A: It optimizes variance in Euclidean space; other metrics break this assumption.Q: How to interpret a dendrogram?A: Heights reflect merge distances; cutting at a level yields clusters.Q: How to choose n_clusters?A: Select a dendrogram cut level or use silhouette/stability analyses.Q: PCA objective?A: Maximize explained variance along orthogonal components.Q: Why center (and often scale) before PCA?A: Centering is required; scaling avoids dominance by large-scale features.Q: How to pick n_components?A: Fixed count, cumulative variance threshold (e.g., 0.95), 'mle', or None.Q: Key PCA outputs?A: components_, explained_variance_, explained_variance_ratio_, singular_values_, mean_.Q: What is whitening in PCA?A: Scaling components to unit variance; helpful for some models but changes variance structure.Q: Can PCA help clustering?A: Yes; reduces noise and dimensionality before clustering.Q: Common pitfalls?A: Misreading dendrograms, wrong metric/linkage, skipping scaling, too few/too many components.
Python Example (PCA + Agglomerative)
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
pca = PCA(n_components=2, random_state=42)
X2 = pca.fit_transform(X)
clust = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = clust.fit_predict(X2)
print('Silhouette:', silhouette_score(X2, labels))
Quiz (15)
1) Name two linkage types.
2) Why does Ward require Euclidean distance?
3) How do you read dendrogram heights?
4) How can you pick n_clusters from a dendrogram?
5) When is silhouette score useful here?
6) Why center and scale before PCA?
7) What is the PCA objective?
8) How do you choose n_components by variance?
9) What does explained_variance_ratio_ show?
10) What do components_ represent?
11) What is whitening and a caution about it?
12) Which metric/linkage pair must you avoid?
13) Why run PCA before clustering?
14) Name an SVD solver option for large data.
15) What is a common pitfall when clustering without scaling?
Practical Checklist (10)
Choose linkage consistent with the metric (Ward + Euclidean).
Standardize features before PCA.
Inspect explained variance and scree plots.
Select n_components via threshold or domain needs.
Visualize clusters in PCA space.
Validate with silhouette/stability.
Avoid non-Euclidean metrics with Ward.
Be cautious with whitening.
Verify dendrogram interpretation.
Log random_state and configs.
Quiz Answers
1) Ward, complete, average (any two).
2) Ward optimizes variance in Euclidean space; other metrics break the assumption.
3) Heights show merge distances; higher merges indicate less similar clusters.
4) Cut the dendrogram at a chosen height to yield that many clusters.
5) To assess cluster separation quality after clustering.
6) Centering is required; scaling prevents large-scale features from dominating.
7) Maximize variance along orthogonal components.
8) Choose a cumulative variance threshold (e.g., 0.9–0.95) or fixed count.
9) Fraction of total variance explained by each component.
10) Principal directions (loadings) onto which data are projected.
11) Scaling components to unit variance; can help some models but distorts variance structure.
12) Ward with non-Euclidean metrics.
13) To denoise, reduce dimensionality, and aid cluster separation/visualization.
14) randomized (fast approximate SVD).
15) Distances become dominated by large-scale features, distorting clusters.
Common Implementation Errors (10)
Using Ward linkage with non-Euclidean distances; violates assumptions.
Skipping centering/scaling before PCA; large-scale features dominate components.
Choosing too few/many PCA components without checking explained variance or reconstruction error.
Misinterpreting dendrogram heights and arbitrary cut levels; poor cluster selection.
Applying whitening indiscriminately; alters variance structure and downstream distances.
Mixing distance metrics and linkage choices inconsistently (e.g., complete + cosine without validation).
Projecting to PCA space for clustering but forgetting to apply the same transform at inference.
Ignoring outliers; they distort PCA directions and cluster assignments.
Not validating cluster stability (bootstraps/silhouette) before reporting results.
Building transforms outside a Pipeline; risk of leakage between training and evaluation.