K-means Clustering — Rapid Q&A Refresher (DSBA)

Date: December 20, 2025 · Author: P Baburaj Ambalam

Version 2.0 · Last updated: December 21, 2025

Technique Description

K-means partitions data into k clusters by minimizing the within-cluster sum of squares (WCSS), assigning points to the nearest centroid and updating centroids iteratively (Lloyd's algorithm) until convergence. Initialization strongly influences results; k-means++ selects spread-out seeds to improve stability, with multiple restarts (n_init) recommended.

Explain the Technique (Four Levels)

For a 10-year-old

We group similar points and place a center in each group, moving centers until groups fit well.

For a beginner student

Assign points to the nearest centroid, update centroids, and repeat; choose k via elbow or silhouette.

For an intermediate student

Use k-means++ seeds and a high n_init, scale features, and know K-means struggles with non-spherical clusters.

For an expert

Optimize WCSS via Lloyd’s algorithm; initialization sensitivity matters; consider GMM/DBSCAN for complex structures.

When to Use This Technique

Ideal Use Cases

Well-separated, spherical clusters of similar density
Fast clustering on large datasets (scales linearly)
Number of clusters k can be estimated or tested
Customer segmentation, image compression, data preprocessing
Initial exploration before more complex clustering

Avoid When

Non-spherical or varying-density clusters → Use DBSCAN, GMM, or Hierarchical Clustering
Unknown k without good heuristics → Try hierarchical methods
Heavy outliers present → Use robust alternatives or preprocess
Categorical-only data → K-modes or other categorical clustering

Related Techniques

→ Hierarchical Clustering (alternative approach with dendrograms)
→ PCA (dimensionality reduction before clustering)
→ Feature Scaling (essential preprocessing)

Q&A

What objective does K-means optimize?Minimizing WCSS (inertia) across k clusters.
What algorithm does K-means use?Assign to nearest centroid, update centroids, repeat (Lloyd's algorithm).
Why k-means++ initialization?Spreads initial centers probabilistically; reduces poor local minima.
Key hyperparameters?n_clusters, init, n_init, max_iter, tol, random_state.
Why scale features before K-means?Distances are scale-sensitive; larger-scale features dominate without scaling.
What is inertia?Sum of squared distances from samples to their nearest cluster center (WCSS).
How does the elbow method work?Plot inertia vs k; choose k where the curve bends (diminishing returns).
What does silhouette score measure?Cluster cohesion and separation; ranges from -1 to 1, higher is better.
Why set n_init high (e.g., 10)?Runs multiple initializations; returns best result, avoiding bad local minima.
When does K-means converge?When assignments don't change or improvement is below tol.
What if K-means doesn't converge?Increase max_iter or check for degenerate data/extreme outliers.
Can K-means handle categorical data?No directly; use encoding (one-hot) or k-modes for categorical clustering.
What is the role of random_state?Controls initialization randomness for reproducibility.
Is K-means suitable for non-spherical clusters?No; it assumes spherical clusters. Use DBSCAN, GMM, or spectral clustering.
How to handle outliers in K-means?Remove/downweight outliers, use robust scaling, or try k-medoids.
What distance metric does K-means use?Euclidean distance (L2 norm) between samples and centroids.
Can you change the distance metric?Standard K-means uses Euclidean; k-medoids/PAM allow other metrics.
What is the computational complexity?O(n \u00d7 k \u00d7 i \u00d7 d) where n=samples, k=clusters, i=iterations, d=features.
How to evaluate clustering quality without labels?Use inertia, silhouette score, Davies-Bouldin index, or Calinski-Harabasz score.
What is the difference between K-means and K-medoids?K-means uses mean centroids; K-medoids uses actual data points (more robust).
Can K-means produce empty clusters?Yes, if a centroid has no nearest points; sklearn handles by reinitializing.
What is mini-batch K-means?Faster variant using random batches; trades accuracy for speed on large datasets.

Python Example

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.randn(300, 2)
X[:100] += [2, 2]  # Cluster 1
X[100:200] += [-2, -2]  # Cluster 2
X[200:] += [2, -2]  # Cluster 3

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-means
kmeans = KMeans(
    n_clusters=3,
    init='k-means++',
    n_init=10,
    max_iter=300,
    random_state=42
)

kmeans.fit(X_scaled)
labels = kmeans.labels_

# Evaluation
inertia = kmeans.inertia_
silhouette = silhouette_score(X_scaled, labels)

print(f"Inertia (WCSS): {inertia:.2f}")
print(f"Silhouette Score: {silhouette:.3f}")
print(f"Cluster Centers:\n{kmeans.cluster_centers_}")

# Elbow method - find optimal k
inertias = []
K_range = range(2, 11)
for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

# Plot elbow curve
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()

Quiz (15)

What objective does K-means optimize?
What does k-means++ improve?
Why set a high n_init?
Why scale features before K-means?
How do you choose k?
What does inertia represent?
What does silhouette score represent?
Is K-means good for non-spherical clusters?
How should you handle outliers?
Can K-means handle categorical features directly?
When does K-means stop iterating?
What is the effect of max_iter?
Why is initialization important?
Which distance does standard K-means rely on?
Name an alternative for varying-density clusters.

Practical Checklist

Scale features before clustering.
Use k-means++ and high n_init.
Inspect inertia and silhouette.
Try multiple k values.
Consider PCA for visualization.

Quiz Answers

Minimizing within-cluster sum of squares (inertia/WCSS).
Better seeding to reduce poor local minima.
Increase stability and avoid bad local minima.
Distances are scale-sensitive; scaling prevents dominance by large-scale features.
Elbow, silhouette analysis, or domain guidance.
Sum of squared distances to centroids.
Compactness/separation; -1 to 1, higher is better.
No; consider DBSCAN/GMM/spectral clustering.
Use robust scaling, trim outliers, or k-medoids.
No; requires encoding or alternative algorithms.
When assignments stabilize or improvement is below tol.
Caps iterations; too low may stop before convergence.
Poor seeds can trap in bad local minima.
Euclidean distance.
DBSCAN, HDBSCAN, or Gaussian Mixture Models.

Common Implementation Errors (10)

Skipping feature scaling; large-scale variables dominate Euclidean distances.
Using too small n_init; unstable clustering due to poor seeds.
Choosing k solely by inertia without validating with silhouette or domain needs.
Applying K-means to non-spherical or varying-density clusters; poor separation.
Not handling outliers; centroids pulled off-center and distort assignments.
Using K-means directly on categorical features without suitable encoding.
Failing to fix random_state; non-reproducible clustering across runs.
Interpreting silhouette incorrectly (e.g., comparing across different feature scalings or datasets).
Stopping early with low max_iter; convergence not reached.
Assuming Euclidean distance is appropriate without checking feature semantics.

References

scikit-learn — Clustering
scikit-learn — KMeans
Lloyd (1982), Arthur & Vassilvitskii (2007)