==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 3 OF 6
Grouping by Nearest Centre: K-Means From a Blank Sheet
Posted: 2026-06-09 . Author: Rahul Rai . Tags: k-means, clustering, unsupervised
============================================================================================
PATH . post 18 of 28
<- prev: Chapter 6, Part 2: The Strongest Direction (PCA)
next: Chapter 6, Part 4: The Family Tree ->
Part 1 measured the gap between every pair of rows. Part 2 crushed many columns down
to a flat page. Neither one actually GROUPED anything -- they just measured and drew.
This post does the grouping. Still no answer column: nobody tells us which states or
wines belong together. We have to carve the cloud of dots into K piles ourselves, using
nothing but the straight-line gap from Part 1.
K-means is the simplest grouping machine there is, and the whole thing is two moves
repeated until nothing changes: ASSIGN every dot to its nearest centre, then MOVE each
centre to the middle of the dots that chose it. That is it. Two moves, on a loop.
## The Job
a cloud of dots (rows), and a number K you pick
carve the cloud into K piles so each pile is tight (its dots sit close together)
* * * [ * * ] [ * ]
* * * * -> [ * * ] and [ * * ] (K = 2 piles)
* * * tight pile tight pile
You choose K up front -- "I want 3 groups." The machine does not invent K; it only
finds the best K piles once you name the number. Picking K well is its own problem,
and we get to it at the end.
## The Two Moves
Start by dropping K centres anywhere (pick K random dots to be the first centres).
Then loop:
MOVE 1 -- ASSIGN. For every dot, measure its straight-line gap to all K centres.
The dot joins the pile of its NEAREST centre.
MOVE 2 -- UPDATE. For each pile, find the middle (the mean of its dots, column by
column). That mean becomes the pile's new centre.
repeat ASSIGN, then UPDATE, then ASSIGN, then UPDATE ...
STOP when no dot changes pile (the centres stop moving)
Each loop can only make the piles tighter or leave them the same -- never worse. So it
always settles. The settling point is the answer.
Why can UPDATE never make things worse? Because the mean IS the spot that makes its
own pile's squared gaps smallest -- that is what a mean does. Moving the centre to
the mean either lowers the pile's tightness or leaves it alone; it cannot raise it.
(A quiz likes to claim "the update step increases the within-group sum of squares" --
that is exactly backwards.)
>> NOTE: COMPARE SQUARED GAPS -- SKIP THE SQUARE ROOT
ASSIGN only asks WHICH centre is nearest, never HOW FAR. If gap A-squared is
smaller than gap B-squared, then gap A is smaller than gap B -- squaring keeps
the order. So the machine compares squared gaps directly and never takes a
root. Same winner, less arithmetic.
## A Worked Example, by Hand
Six dots on one ruler (one column, to keep the arithmetic small). K = 2.
dots: 1 2 3 10 11 12
Drop two starting centres. Say the random pick lands on 2 and 10.
ROUND 1 -- ASSIGN (each dot to nearest centre):
dot 1 -> centre 2 (gap 1 vs 9) pile A
dot 2 -> centre 2 (gap 0) pile A
dot 3 -> centre 2 (gap 1 vs 7) pile A
dot 10 -> centre 10 pile B
dot 11 -> centre 10 (gap 1) pile B
dot 12 -> centre 10 (gap 2) pile B
ROUND 1 -- UPDATE (mean of each pile):
pile A = mean(1,2,3) = 2
pile B = mean(10,11,12) = 11
new centres: 2 and 11
ROUND 2 -- ASSIGN: every dot stays in the same pile.
Nobody moved. STOP.
Final piles: {1, 2, 3} and {10, 11, 12}. Centres 2 and 11.
The machine found the two obvious clumps on its own, with no answer key.
>> YOUR TURN
A seventh dot walks in at 5. The settled centres are 2 and 11. Which pile does
ASSIGN hand it to? Compare squared gaps -- no roots needed.
check your slate: gap to centre 2: 5 - 2 = 3, squared 9; gap to centre 11:
5 - 11 = -6, squared 36. 9 < 36, so the dot joins pile A -- and since squaring
keeps the order, the winner is the same one a square root would have crowned.
## The Score: Tightness (Inertia)
IN HAND: six dots carved into two piles, {1,2,3} and {10,11,12}, by two moves on a
loop; the settled centres are mean(1,2,3) = 6/3 = 2 and mean(10,11,12) = 33/3 = 11.
This section adds: one number that says how GOOD that grouping is.
How good is a grouping? Measure how tight the piles are. For every dot, take its
squared gap to its OWN centre, and add them all up:
tightness = sum over all dots of (gap from dot to its centre)^2
small tightness = dots hug their centres = tight, good piles
big tightness = dots sprawl far from their centres = loose, poor piles
Foot it on the six dots. Pile A first, centre mean(1,2,3) = 6/3 = 2:
dot gap to centre 2 squared
-----------------------------------
1 1 - 2 = -1 1
2 2 - 2 = 0 0
3 3 - 2 = 1 1
pile A sum = 2
>> YOUR TURN
Foot pile B the same way (dots 10, 11, 12; centre mean(10,11,12) = 33/3 = 11),
then add the two piles into one tightness for the whole grouping.
check your slate: 10 - 11 = -1, squared 1; 11 - 11 = 0, squared 0;
12 - 11 = 1, squared 1; pile B sum = 1 + 0 + 1 = 2. Whole grouping:
2 + 2 = 4. Four is the number the two moves were driving down all along.
K-means is just the machine that drives this one number as low as it can, by the two
moves on a loop. This is the same "bottom of the bowl" idea from Chapter 1 -- squared
misses, pushed to their smallest -- only now the "miss" is a dot's gap from its centre,
and there is no answer column anywhere.
## The Catch: Where You Drop the First Centres Matters
K-means settles, but it can settle into a BAD grouping if the starting centres were
unlucky. It finds a low point, not always the LOWEST point.
unlucky start: two centres land inside the same clump
-> the machine splits one real clump in half and merges two others
-> it settles, but the piles are wrong
The fix is cheap: run the whole thing several times from different random starts, and
keep the run with the smallest tightness. The toolbox does this for you (n_init), and
a smarter starting trick (k-means++) spreads the first centres far apart on purpose.
## Choosing K: The Elbow
IN HAND: two moves on a loop, a tightness to score what they settle on (the six-dot
grouping foots to 1+0+1 = 2 per pile, 2 + 2 = 4 in all), and the warning that an
unlucky start can trap the machine in a poor low spot. This section adds: how to
choose K itself.
You picked K by hand -- but which K? Run K-means for K = 1, 2, 3, 4, 5, ... and plot
the tightness for each:
tightness |*
| \
| \
| *
| \___
| *----*----*----* <- the curve flattens out
+------------------------- K
1 2 3 4 5 6
Tightness always drops as K grows (more centres -> dots sit closer to one). At K = the
number of real clumps, the drop suddenly flattens -- adding more centres barely helps.
That bend is the ELBOW, and it is your best guess at the true number of groups.
>> NOTE: THE ELBOW IS A JUDGEMENT CALL, NOT A FORMULA
There is no equation that spits out "K = 3." You look at the bend and decide.
Sometimes the bend is sharp and obvious; sometimes it is a gentle curve with no
clear corner, and reasonable people pick different K. That is honest -- with no
answer column, there is no single right number of groups.
If "big vs small by eye" feels like astrology, there is a judgement-free trick:
lay a ruler from the FIRST dot of the curve to the LAST dot, then measure how far
each dot sags below that ruler. The dot with the BIGGEST sag is the elbow. The
two endpoints always sag zero, so the answer is forced to sit somewhere in between.
## A Second Ruler: How Well Does Each Dot Fit Its Pile?
IN HAND: a grouping for every K, each scored by one tightness total (at K = 2 the six
dots foot to 2 + 2 = 4), and an elbow to squint at on the tightness curve. This
section adds: a second ruler, read per dot, with a peak instead of a bend.
The elbow reads ONE number for the whole grouping (total tightness) and asks you to
eyeball a bend. There is a second, sharper question you can ask of EVERY single dot:
"Are you snug in your own pile -- or would the pile next door fit you better?"
For one dot, compute two averages:
a = average gap from this dot to its OWN pile-mates (how far from home)
b = average gap from this dot to the NEAREST OTHER pile (how far to next door)
fit score = (b - a) / whichever of a, b is bigger
near +1 -> home is much closer than next door. snug. well-grouped.
near 0 -> home and next door are the same distance. on the fence.
negative -> next door is CLOSER than home. this dot is in the WRONG pile.
By hand, on the six dots from before ({1,2,3} and {10,11,12}):
dot 1: a = avg gap to {2,3} = (1 + 2) / 2 = 1.5
b = avg gap to {10,11,12} = (9 + 10 + 11) / 3 = 10
fit = (10 - 1.5) / 10 = 0.85 <- snug at home
dot 3: a = avg gap to {1,2} = (2 + 1) / 2 = 1.5
b = avg gap to {10,11,12} = (7 + 8 + 9) / 3 = 8
fit = (8 - 1.5) / 8 = 0.81 <- the pile's edge dot, still snug
Average the fit score over ALL dots and you get one number for the whole grouping.
Now run K-means for K = 2, 3, 4, 5 ... and plot that average:
avg fit | *
| / \
| / \
| * *---*
| /
+--------------------- K
2 3 4 5
No bend to squint at -- you just pick the K with the HIGHEST peak. A clean
maximum instead of a fuzzy corner.
** KEY: TWO RULERS, SAME QUESTION, PICK DIFFERENTLY
The elbow reads TIGHTNESS (squared gaps to centres) and you hunt a BEND.
The fit score reads HOME-VS-NEXT-DOOR gaps and you hunt a PEAK.
They answer the same "how many groups?" question with different measurements,
and the peak is the easier read. When both agree on K, trust it.
>> NOTE: THE FIT SCORE STARTS AT K = 2, NOT K = 1
With one pile there is no "next door" -- b does not exist, so the score is
undefined. The elbow curve can start at K = 1; the fit-score curve cannot.
That is why the loop below runs K = 2 upward.
One more read the fit score gives for free: average it PER PILE, not just overall.
A whole pile averaging near 0 means that entire pile straddles the border with its
neighbour -- two piles drawn where the data holds one.
## What Makes ANY Grouper Good
Before the tripwires, the yardstick that outlives K-means. Whatever grouping
machine you ever meet, it is judged on the same short list:
within a pile: look-alikes hugging -> want gaps SMALL
between piles: piles far from each other -> want gaps BIG
bonus: handles bent or stretched crowds, not only round blobs
bonus: still runs when the sheet holds a million dots
never: depends on an answer column -- there isn't one
K-means scores well on the million-dots line and fails the bent-crowds line: one
gap-to-centre treats every pile as a round, same-size blob. Crowds shaped like
ribbons or crescents get butchered. The family tree (Part 4, single linkage) chains
along bent shapes; there are also machines that fit each pile a lean-and-stretch
recipe (Gaussian mixtures) instead of a single centre.
## Common Tripwires I Caught
TRIPWIRE 1: Standardise first -- same as Part 1.
K-means lives entirely on the straight-line gap, so a column measured in
thousands hijacks every distance. Put every column on the same ruler
(mean 0, spread 1) before the first centre is dropped.
TRIPWIRE 2: K-means needs round, similar-sized clumps.
The "nearest centre" rule draws straight walls halfway between centres. If
the real groups are long ribbons or crescents, K-means slices them wrongly.
It assumes blobs. When the shape is weird, the family tree in Part 4 or
other methods do better.
TRIPWIRE 3: The answer changes with the random start.
Two runs can hand back different piles. This is not a bug -- it is the
unlucky-start problem. Run several times (n_init) and keep the tightest.
TRIPWIRE 4: K is yours to choose; the machine will obey blindly.
Ask for K = 5 on data with 2 real clumps and it WILL hand you 5 piles,
happily splitting real groups to do it. The elbow is your guard against
a silly K.
TRIPWIRE 5: Tightness (inertia) always falls as K rises.
You cannot pick K by "smallest tightness" -- that just picks the biggest K
(one centre per dot, tightness 0, useless). You pick K at the ELBOW, where
the falling stops paying off. Chasing the "smallest DROP between neighbouring
K values" is equally wrong: the smallest drop is always at the far right, where
tightness is nearly flat and each extra centre barely moves the needle. The
smallest drop is a way of finding the most useless K, not the most useful one.
TRIPWIRE 6: A centre is not a real dot.
The centre is the MEAN of a pile -- usually a point with no actual row
sitting on it. It is the pile's centre of gravity, not a member.
TRIPWIRE 7: The fit score's b uses the NEAREST other pile, not all of them.
With 3+ piles it is tempting to average the gaps to every foreign dot.
Wrong -- b is the average gap to the one CLOSEST foreign pile only. The
question is "would next door fit better?", and next door means the nearest
neighbour pile, not the whole neighbourhood.
TRIPWIRE 8: Tightness is ONE total, not one-per-dot.
Each dot gets a pile label; the GROUPING gets a single tightness number
(all the squared gaps summed). The fit score is the opposite: one score
per dot first, averaged into one number at the end. Do not mix the two
shapes up.
TRIPWIRE 9: .labels_ gives pile number, not distance.
After .fit, km.labels_ holds one integer per row -- which pile that
row landed in (0, 1, 2 ... K-1). It tells you WHICH pile, not HOW
FAR from its centre. To get distances, call km.transform(X), which
returns a gap-to-every-centre matrix.
TRIPWIRE 10: STOP means no dot switched pile -- not "distances stopped changing."
The loop halts when an entire round produces zero pile-switches: every
dot ends in the same pile it was already in. A dot CAN keep the same
label while its distance to the centre changes -- the centre moved
because its other pile-mates shifted. The STOP check is "did this
round's assignment exactly match the previous round's?" not "did the
distances stop shrinking?"
TRIPWIRE 11: Distance lives only in the ASSIGN step.
Two moves, one loop. ASSIGN uses the straight-line gap to choose the
nearest centre. UPDATE uses column means to recompute each centre --
no distances anywhere. The UPDATE step averages coordinates; it never
measures a gap. Keeping the two moves separate makes the whole
algorithm a lot cleaner to reason about.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
grouping with no answer column clustering (unsupervised)
a pile of dots a cluster
the centre of a pile centroid / cluster mean
assign each dot to nearest centre the assignment step
move each centre to its pile's mean the update step
tightness (sum of squared gaps) inertia / within-cluster sum of squares
the bend in the tightness curve the elbow method
the biggest sag below the ruler the perpendicular-distance elbow rule
home vs next-door fit score silhouette coefficient
average fit over all dots mean silhouette score
lean-and-stretch pile recipes Gaussian mixture models (GMM)
pick K at the highest peak silhouette analysis
run from several random starts n_init / multiple restarts
spread first centres far apart k-means++ initialisation
number of piles you pick K (the n_clusters hyperparameter)
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
KMeans(n_clusters=K) -- the grouping machine, K piles
.fit(X) -- run the two moves on a loop until settled
.labels_ -- which pile each row landed in (0..K-1)
.cluster_centers_ -- the final centre of each pile
.inertia_ -- the tightness score (lower = tighter)
silhouette_score(X, labels) -- the average home-vs-next-door fit score
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# same-ruler first
X_scaled = StandardScaler().fit_transform(df)
# one grouping with K = 3
km = KMeans(n_clusters=3, n_init=10, random_state=42)
km.fit(X_scaled)
labels = km.labels_ # pile per row
centres = km.cluster_centers_ # the 3 centres
print(km.inertia_) # the tightness
# the elbow: tightness for K = 1..8
tightness = []
for k in range(1, 9):
km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
tightness.append(km.inertia_)
plt.plot(range(1, 9), tightness, marker="o")
plt.xlabel("K (number of piles)")
plt.ylabel("tightness (inertia)")
plt.title("Elbow: where does adding piles stop helping?")
plt.grid(True, linestyle="--", alpha=0.4)
plt.show()
# pick K at the bend
# the second ruler: average fit score per K (starts at K=2 -- see the note)
from sklearn.metrics import silhouette_score
fits = []
for k in range(2, 9):
km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
fits.append(silhouette_score(X_scaled, km.labels_))
plt.plot(range(2, 9), fits, marker="o")
plt.xlabel("K (number of piles)")
plt.ylabel("average fit score (silhouette)")
plt.title("Fit score: pick K at the highest peak")
plt.grid(True, linestyle="--", alpha=0.4)
plt.show()
# pick K at the PEAK -- no squinting at a bend
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
Part 1 -- Looking at a Sheet With No Answers .
Part 2 -- The Strongest Direction (PCA) .
Part 3 (this post) .
Part 4 -- The Family Tree (Hierarchical Clustering) .
Part 5 -- Both Tools on NCI60 (Re-visited) .
Part 6 -- Filling the Blanks (Recommender Systems)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================