K-means Clustering — Rapid Q&A Refresher (DSBA)
Date: December 20, 2025 · Author: P Baburaj Ambalam
Version 2.0 · Last updated: December 21, 2025
Quick Navigation:
Description ·
Explain ·
When to Use ·
Q&A ·
Example ·
Quiz ·
Checklist ·
Answers ·
Errors ·
References
Technique Description
K-means partitions data into k clusters by minimizing the within-cluster sum of squares (WCSS), assigning points to the nearest centroid and updating centroids iteratively (Lloyd's algorithm) until convergence. Initialization strongly influences results; k-means++ selects spread-out seeds to improve stability, with multiple restarts (n_init) recommended.
Explain the Technique (Four Levels)
For a 10-year-old
We group similar points and place a center in each group, moving centers until groups fit well.
For a beginner student
Assign points to the nearest centroid, update centroids, and repeat; choose k via elbow or silhouette.
For an intermediate student
Use k-means++ seeds and a high n_init, scale features, and know K-means struggles with non-spherical clusters.
For an expert
Optimize WCSS via Lloyd’s algorithm; initialization sensitivity matters; consider GMM/DBSCAN for complex structures.
When to Use This Technique
Ideal Use Cases
Well-separated, spherical clusters of similar density
Fast clustering on large datasets (scales linearly)
Number of clusters k can be estimated or tested
Customer segmentation, image compression, data preprocessing
Initial exploration before more complex clustering
Avoid When
Non-spherical or varying-density clusters → Use DBSCAN, GMM, or Hierarchical Clustering
Unknown k without good heuristics → Try hierarchical methods
Heavy outliers present → Use robust alternatives or preprocess
Categorical-only data → K-modes or other categorical clustering
Related Techniques
→ Hierarchical Clustering (alternative approach with dendrograms)
→ PCA (dimensionality reduction before clustering)
→ Feature Scaling (essential preprocessing)
Q&A
What objective does K-means optimize? Minimizing WCSS (inertia) across k clusters.
What algorithm does K-means use? Assign to nearest centroid, update centroids, repeat (Lloyd's algorithm).
Why k-means++ initialization? Spreads initial centers probabilistically; reduces poor local minima.
Key hyperparameters? n_clusters, init, n_init, max_iter, tol, random_state.
Why scale features before K-means? Distances are scale-sensitive; larger-scale features dominate without scaling.
What is inertia? Sum of squared distances from samples to their nearest cluster center (WCSS).
How does the elbow method work? Plot inertia vs k; choose k where the curve bends (diminishing returns).
What does silhouette score measure? Cluster cohesion and separation; ranges from -1 to 1, higher is better.
Why set n_init high (e.g., 10)? Runs multiple initializations; returns best result, avoiding bad local minima.
When does K-means converge? When assignments don't change or improvement is below tol.
What if K-means doesn't converge? Increase max_iter or check for degenerate data/extreme outliers.
Can K-means handle categorical data? No directly; use encoding (one-hot) or k-modes for categorical clustering.
What is the role of random_state? Controls initialization randomness for reproducibility.
Is K-means suitable for non-spherical clusters? No; it assumes spherical clusters. Use DBSCAN, GMM, or spectral clustering.
How to handle outliers in K-means? Remove/downweight outliers, use robust scaling, or try k-medoids.
What distance metric does K-means use? Euclidean distance (L2 norm) between samples and centroids.
Can you change the distance metric? Standard K-means uses Euclidean; k-medoids/PAM allow other metrics.
What is the computational complexity? O(n \u00d7 k \u00d7 i \u00d7 d) where n=samples, k=clusters, i=iterations, d=features.
How to evaluate clustering quality without labels? Use inertia, silhouette score, Davies-Bouldin index, or Calinski-Harabasz score.
What is the difference between K-means and K-medoids? K-means uses mean centroids; K-medoids uses actual data points (more robust).
Can K-means produce empty clusters? Yes, if a centroid has no nearest points; sklearn handles by reinitializing.
What is mini-batch K-means? Faster variant using random batches; trades accuracy for speed on large datasets.
Python Example
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
# Generate sample data
np.random.seed(42)
X = np.random.randn(300, 2)
X[:100] += [2, 2] # Cluster 1
X[100:200] += [-2, -2] # Cluster 2
X[200:] += [2, -2] # Cluster 3
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit K-means
kmeans = KMeans(
n_clusters=3,
init='k-means++',
n_init=10,
max_iter=300,
random_state=42
)
kmeans.fit(X_scaled)
labels = kmeans.labels_
# Evaluation
inertia = kmeans.inertia_
silhouette = silhouette_score(X_scaled, labels)
print(f"Inertia (WCSS): {inertia:.2f}")
print(f"Silhouette Score: {silhouette:.3f}")
print(f"Cluster Centers:\n{kmeans.cluster_centers_}")
# Elbow method - find optimal k
inertias = []
K_range = range(2, 11)
for k in K_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
# Plot elbow curve
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()
Quiz (15)
What objective does K-means optimize?
What does k-means++ improve?
Why set a high n_init?
Why scale features before K-means?
How do you choose k?
What does inertia represent?
What does silhouette score represent?
Is K-means good for non-spherical clusters?
How should you handle outliers?
Can K-means handle categorical features directly?
When does K-means stop iterating?
What is the effect of max_iter?
Why is initialization important?
Which distance does standard K-means rely on?
Name an alternative for varying-density clusters.
Practical Checklist
Scale features before clustering.
Use k-means++ and high n_init.
Inspect inertia and silhouette.
Try multiple k values.
Consider PCA for visualization.
Quiz Answers Hide
Minimizing within-cluster sum of squares (inertia/WCSS).
Better seeding to reduce poor local minima.
Increase stability and avoid bad local minima.
Distances are scale-sensitive; scaling prevents dominance by large-scale features.
Elbow, silhouette analysis, or domain guidance.
Sum of squared distances to centroids.
Compactness/separation; -1 to 1, higher is better.
No; consider DBSCAN/GMM/spectral clustering.
Use robust scaling, trim outliers, or k-medoids.
No; requires encoding or alternative algorithms.
When assignments stabilize or improvement is below tol.
Caps iterations; too low may stop before convergence.
Poor seeds can trap in bad local minima.
Euclidean distance.
DBSCAN, HDBSCAN, or Gaussian Mixture Models.
Common Implementation Errors (10)
Skipping feature scaling; large-scale variables dominate Euclidean distances.
Using too small n_init; unstable clustering due to poor seeds.
Choosing k solely by inertia without validating with silhouette or domain needs.
Applying K-means to non-spherical or varying-density clusters; poor separation.
Not handling outliers; centroids pulled off-center and distort assignments.
Using K-means directly on categorical features without suitable encoding.
Failing to fix random_state; non-reproducible clustering across runs.
Interpreting silhouette incorrectly (e.g., comparing across different feature scalings or datasets).
Stopping early with low max_iter; convergence not reached.
Assuming Euclidean distance is appropriate without checking feature semantics.