K-means Clustering — Rapid Q&A Refresher (DSBA)

Date: December 20, 2025 · Author: P Baburaj Ambalam
← Back to index
Version 2.0 · Last updated: December 21, 2025

Technique Description

K-means partitions data into k clusters by minimizing the within-cluster sum of squares (WCSS), assigning points to the nearest centroid and updating centroids iteratively (Lloyd's algorithm) until convergence. Initialization strongly influences results; k-means++ selects spread-out seeds to improve stability, with multiple restarts (n_init) recommended.

Explain the Technique (Four Levels)

For a 10-year-old

We group similar points and place a center in each group, moving centers until groups fit well.

For a beginner student

Assign points to the nearest centroid, update centroids, and repeat; choose k via elbow or silhouette.

For an intermediate student

Use k-means++ seeds and a high n_init, scale features, and know K-means struggles with non-spherical clusters.

For an expert

Optimize WCSS via Lloyd’s algorithm; initialization sensitivity matters; consider GMM/DBSCAN for complex structures.

When to Use This Technique

Ideal Use Cases

Avoid When

Related Techniques

Hierarchical Clustering (alternative approach with dendrograms)
PCA (dimensionality reduction before clustering)
Feature Scaling (essential preprocessing)

Q&A

Python Example

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.randn(300, 2)
X[:100] += [2, 2]  # Cluster 1
X[100:200] += [-2, -2]  # Cluster 2
X[200:] += [2, -2]  # Cluster 3

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-means
kmeans = KMeans(
    n_clusters=3,
    init='k-means++',
    n_init=10,
    max_iter=300,
    random_state=42
)

kmeans.fit(X_scaled)
labels = kmeans.labels_

# Evaluation
inertia = kmeans.inertia_
silhouette = silhouette_score(X_scaled, labels)

print(f"Inertia (WCSS): {inertia:.2f}")
print(f"Silhouette Score: {silhouette:.3f}")
print(f"Cluster Centers:\n{kmeans.cluster_centers_}")

# Elbow method - find optimal k
inertias = []
K_range = range(2, 11)
for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

# Plot elbow curve
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()

Quiz (15)

  1. What objective does K-means optimize?
  2. What does k-means++ improve?
  3. Why set a high n_init?
  4. Why scale features before K-means?
  5. How do you choose k?
  6. What does inertia represent?
  7. What does silhouette score represent?
  8. Is K-means good for non-spherical clusters?
  9. How should you handle outliers?
  10. Can K-means handle categorical features directly?
  11. When does K-means stop iterating?
  12. What is the effect of max_iter?
  13. Why is initialization important?
  14. Which distance does standard K-means rely on?
  15. Name an alternative for varying-density clusters.

Practical Checklist

Quiz Answers

  1. Minimizing within-cluster sum of squares (inertia/WCSS).
  2. Better seeding to reduce poor local minima.
  3. Increase stability and avoid bad local minima.
  4. Distances are scale-sensitive; scaling prevents dominance by large-scale features.
  5. Elbow, silhouette analysis, or domain guidance.
  6. Sum of squared distances to centroids.
  7. Compactness/separation; -1 to 1, higher is better.
  8. No; consider DBSCAN/GMM/spectral clustering.
  9. Use robust scaling, trim outliers, or k-medoids.
  10. No; requires encoding or alternative algorithms.
  11. When assignments stabilize or improvement is below tol.
  12. Caps iterations; too low may stop before convergence.
  13. Poor seeds can trap in bad local minima.
  14. Euclidean distance.
  15. DBSCAN, HDBSCAN, or Gaussian Mixture Models.

Common Implementation Errors (10)

References

  1. scikit-learn — Clustering
  2. scikit-learn — KMeans
  3. Lloyd (1982), Arthur & Vassilvitskii (2007)