==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 3 OF 6
  Grouping by Nearest Centre: K-Means From a Blank Sheet
  Posted: 2026-06-09 . Author: Rahul Rai . Tags: k-means, clustering, unsupervised
  ============================================================================================

  PATH . post 18 of 28
    <- prev:  Chapter 6, Part 2: The Strongest Direction (PCA)
       next:  Chapter 6, Part 4: The Family Tree ->

  Part 1 measured the gap between every pair of rows.  Part 2 crushed many columns down
  to a flat page.  Neither one actually GROUPED anything -- they just measured and drew.
  This post does the grouping.  Still no answer column: nobody tells us which states or
  wines belong together.  We have to carve the cloud of dots into K piles ourselves, using
  nothing but the straight-line gap from Part 1.

  K-means is the simplest grouping machine there is, and the whole thing is two moves
  repeated until nothing changes: ASSIGN every dot to its nearest centre, then MOVE each
  centre to the middle of the dots that chose it.  That is it.  Two moves, on a loop.

  ## The Job

    a cloud of dots (rows), and a number K you pick
    carve the cloud into K piles so each pile is tight (its dots sit close together)

      *  *           *                    [ * * ]         [ * ]
       * *      *  *           ->          [ * * ]   and   [ * * ]   (K = 2 piles)
     *      *  *                          tight pile      tight pile

  You choose K up front -- "I want 3 groups."  The machine does not invent K; it only
  finds the best K piles once you name the number.  Picking K well is its own problem,
  and we get to it at the end.

  ## The Two Moves

  Start by dropping K centres anywhere (pick K random dots to be the first centres).
  Then loop:

    MOVE 1 -- ASSIGN.  For every dot, measure its straight-line gap to all K centres.
              The dot joins the pile of its NEAREST centre.

    MOVE 2 -- UPDATE.  For each pile, find the middle (the mean of its dots, column by
              column).  That mean becomes the pile's new centre.

    repeat ASSIGN, then UPDATE, then ASSIGN, then UPDATE ...
    STOP when no dot changes pile (the centres stop moving)

  Each loop can only make the piles tighter or leave them the same -- never worse.  So it
  always settles.  The settling point is the answer.

  Why can UPDATE never make things worse?  Because the mean IS the spot that makes its
  own pile's squared gaps smallest -- that is what a mean does.  Moving the centre to
  the mean either lowers the pile's tightness or leaves it alone; it cannot raise it.
  (A quiz likes to claim "the update step increases the within-group sum of squares" --
  that is exactly backwards.)

  >> NOTE: COMPARE SQUARED GAPS -- SKIP THE SQUARE ROOT
     ASSIGN only asks WHICH centre is nearest, never HOW FAR.  If gap A-squared is
     smaller than gap B-squared, then gap A is smaller than gap B -- squaring keeps
     the order.  So the machine compares squared gaps directly and never takes a
     root.  Same winner, less arithmetic.

  ## A Worked Example, by Hand

  Six dots on one ruler (one column, to keep the arithmetic small).  K = 2.

    dots:   1   2   3       10  11  12

  Drop two starting centres.  Say the random pick lands on 2 and 10.

    ROUND 1 -- ASSIGN (each dot to nearest centre):
      dot 1  -> centre 2  (gap 1 vs 9)     pile A
      dot 2  -> centre 2  (gap 0)          pile A
      dot 3  -> centre 2  (gap 1 vs 7)     pile A
      dot 10 -> centre 10                  pile B
      dot 11 -> centre 10 (gap 1)          pile B
      dot 12 -> centre 10 (gap 2)          pile B

    ROUND 1 -- UPDATE (mean of each pile):
      pile A = mean(1,2,3)    = 2
      pile B = mean(10,11,12) = 11
      new centres: 2 and 11

    ROUND 2 -- ASSIGN: every dot stays in the same pile.
      Nobody moved.  STOP.

    Final piles:  {1, 2, 3}  and  {10, 11, 12}.  Centres 2 and 11.

  The machine found the two obvious clumps on its own, with no answer key.

  >> YOUR TURN
     A seventh dot walks in at 5.  The settled centres are 2 and 11.  Which pile does
     ASSIGN hand it to?  Compare squared gaps -- no roots needed.

     check your slate:  gap to centre 2: 5 - 2 = 3, squared 9;  gap to centre 11:
     5 - 11 = -6, squared 36.  9 < 36, so the dot joins pile A -- and since squaring
     keeps the order, the winner is the same one a square root would have crowned.

  ## The Score: Tightness (Inertia)

  IN HAND: six dots carved into two piles, {1,2,3} and {10,11,12}, by two moves on a
  loop; the settled centres are mean(1,2,3) = 6/3 = 2 and mean(10,11,12) = 33/3 = 11.
  This section adds: one number that says how GOOD that grouping is.

  How good is a grouping?  Measure how tight the piles are.  For every dot, take its
  squared gap to its OWN centre, and add them all up:

    tightness = sum over all dots of (gap from dot to its centre)^2

    small tightness = dots hug their centres = tight, good piles
    big tightness   = dots sprawl far from their centres = loose, poor piles

  Foot it on the six dots.  Pile A first, centre mean(1,2,3) = 6/3 = 2:

    dot    gap to centre 2    squared
    -----------------------------------
     1       1 - 2 = -1          1
     2       2 - 2 =  0          0
     3       3 - 2 =  1          1
                      pile A sum = 2

  >> YOUR TURN
     Foot pile B the same way (dots 10, 11, 12; centre mean(10,11,12) = 33/3 = 11),
     then add the two piles into one tightness for the whole grouping.

     check your slate:  10 - 11 = -1, squared 1;  11 - 11 = 0, squared 0;
     12 - 11 = 1, squared 1;  pile B sum = 1 + 0 + 1 = 2.  Whole grouping:
     2 + 2 = 4.  Four is the number the two moves were driving down all along.

  K-means is just the machine that drives this one number as low as it can, by the two
  moves on a loop.  This is the same "bottom of the bowl" idea from Chapter 1 -- squared
  misses, pushed to their smallest -- only now the "miss" is a dot's gap from its centre,
  and there is no answer column anywhere.

  ## The Catch: Where You Drop the First Centres Matters

  K-means settles, but it can settle into a BAD grouping if the starting centres were
  unlucky.  It finds a low point, not always the LOWEST point.

    unlucky start:  two centres land inside the same clump
                    -> the machine splits one real clump in half and merges two others
                    -> it settles, but the piles are wrong

  The fix is cheap: run the whole thing several times from different random starts, and
  keep the run with the smallest tightness.  The toolbox does this for you (n_init), and
  a smarter starting trick (k-means++) spreads the first centres far apart on purpose.

  ## Choosing K: The Elbow

  IN HAND: two moves on a loop, a tightness to score what they settle on (the six-dot
  grouping foots to 1+0+1 = 2 per pile, 2 + 2 = 4 in all), and the warning that an
  unlucky start can trap the machine in a poor low spot.  This section adds: how to
  choose K itself.

  You picked K by hand -- but which K?  Run K-means for K = 1, 2, 3, 4, 5, ... and plot
  the tightness for each:

    tightness |*
              | \
              |  \
              |   *
              |    \___
              |        *----*----*----*   <- the curve flattens out
              +------------------------- K
                1   2   3   4   5   6

  Tightness always drops as K grows (more centres -> dots sit closer to one).  At K = the
  number of real clumps, the drop suddenly flattens -- adding more centres barely helps.
  That bend is the ELBOW, and it is your best guess at the true number of groups.

  >> NOTE: THE ELBOW IS A JUDGEMENT CALL, NOT A FORMULA
     There is no equation that spits out "K = 3."  You look at the bend and decide.
     Sometimes the bend is sharp and obvious; sometimes it is a gentle curve with no
     clear corner, and reasonable people pick different K.  That is honest -- with no
     answer column, there is no single right number of groups.

     If "big vs small by eye" feels like astrology, there is a judgement-free trick:
     lay a ruler from the FIRST dot of the curve to the LAST dot, then measure how far
     each dot sags below that ruler.  The dot with the BIGGEST sag is the elbow.  The
     two endpoints always sag zero, so the answer is forced to sit somewhere in between.

  ## A Second Ruler: How Well Does Each Dot Fit Its Pile?

  IN HAND: a grouping for every K, each scored by one tightness total (at K = 2 the six
  dots foot to 2 + 2 = 4), and an elbow to squint at on the tightness curve.  This
  section adds: a second ruler, read per dot, with a peak instead of a bend.

  The elbow reads ONE number for the whole grouping (total tightness) and asks you to
  eyeball a bend.  There is a second, sharper question you can ask of EVERY single dot:

    "Are you snug in your own pile -- or would the pile next door fit you better?"

  For one dot, compute two averages:

    a = average gap from this dot to its OWN pile-mates        (how far from home)
    b = average gap from this dot to the NEAREST OTHER pile    (how far to next door)

    fit score = (b - a) / whichever of a, b is bigger

    near +1   ->  home is much closer than next door.  snug.  well-grouped.
    near  0   ->  home and next door are the same distance.  on the fence.
    negative  ->  next door is CLOSER than home.  this dot is in the WRONG pile.

  By hand, on the six dots from before ({1,2,3} and {10,11,12}):

    dot 1:  a = avg gap to {2,3}        = (1 + 2) / 2        = 1.5
            b = avg gap to {10,11,12}   = (9 + 10 + 11) / 3  = 10
            fit = (10 - 1.5) / 10  = 0.85      <- snug at home

    dot 3:  a = avg gap to {1,2}        = (2 + 1) / 2        = 1.5
            b = avg gap to {10,11,12}   = (7 + 8 + 9) / 3    = 8
            fit = (8 - 1.5) / 8    = 0.81      <- the pile's edge dot, still snug

  Average the fit score over ALL dots and you get one number for the whole grouping.
  Now run K-means for K = 2, 3, 4, 5 ... and plot that average:

    avg fit |        *
            |       / \
            |      /   \
            |     *     *---*
            |    /
            +--------------------- K
               2    3    4    5

  No bend to squint at -- you just pick the K with the HIGHEST peak.  A clean
  maximum instead of a fuzzy corner.

  ** KEY: TWO RULERS, SAME QUESTION, PICK DIFFERENTLY
     The elbow reads TIGHTNESS (squared gaps to centres) and you hunt a BEND.
     The fit score reads HOME-VS-NEXT-DOOR gaps and you hunt a PEAK.
     They answer the same "how many groups?" question with different measurements,
     and the peak is the easier read.  When both agree on K, trust it.

  >> NOTE: THE FIT SCORE STARTS AT K = 2, NOT K = 1
     With one pile there is no "next door" -- b does not exist, so the score is
     undefined.  The elbow curve can start at K = 1; the fit-score curve cannot.
     That is why the loop below runs K = 2 upward.

  One more read the fit score gives for free: average it PER PILE, not just overall.
  A whole pile averaging near 0 means that entire pile straddles the border with its
  neighbour -- two piles drawn where the data holds one.

  ## What Makes ANY Grouper Good

  Before the tripwires, the yardstick that outlives K-means.  Whatever grouping
  machine you ever meet, it is judged on the same short list:

    within a pile:    look-alikes hugging          -> want gaps SMALL
    between piles:    piles far from each other    -> want gaps BIG
    bonus:            handles bent or stretched crowds, not only round blobs
    bonus:            still runs when the sheet holds a million dots
    never:            depends on an answer column -- there isn't one

  K-means scores well on the million-dots line and fails the bent-crowds line: one
  gap-to-centre treats every pile as a round, same-size blob.  Crowds shaped like
  ribbons or crescents get butchered.  The family tree (Part 4, single linkage) chains
  along bent shapes; there are also machines that fit each pile a lean-and-stretch
  recipe (Gaussian mixtures) instead of a single centre.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  Standardise first -- same as Part 1.
       K-means lives entirely on the straight-line gap, so a column measured in
       thousands hijacks every distance.  Put every column on the same ruler
       (mean 0, spread 1) before the first centre is dropped.

    TRIPWIRE 2:  K-means needs round, similar-sized clumps.
       The "nearest centre" rule draws straight walls halfway between centres.  If
       the real groups are long ribbons or crescents, K-means slices them wrongly.
       It assumes blobs.  When the shape is weird, the family tree in Part 4 or
       other methods do better.

    TRIPWIRE 3:  The answer changes with the random start.
       Two runs can hand back different piles.  This is not a bug -- it is the
       unlucky-start problem.  Run several times (n_init) and keep the tightest.

    TRIPWIRE 4:  K is yours to choose; the machine will obey blindly.
       Ask for K = 5 on data with 2 real clumps and it WILL hand you 5 piles,
       happily splitting real groups to do it.  The elbow is your guard against
       a silly K.

    TRIPWIRE 5:  Tightness (inertia) always falls as K rises.
       You cannot pick K by "smallest tightness" -- that just picks the biggest K
       (one centre per dot, tightness 0, useless).  You pick K at the ELBOW, where
       the falling stops paying off.  Chasing the "smallest DROP between neighbouring
       K values" is equally wrong: the smallest drop is always at the far right, where
       tightness is nearly flat and each extra centre barely moves the needle.  The
       smallest drop is a way of finding the most useless K, not the most useful one.

    TRIPWIRE 6:  A centre is not a real dot.
       The centre is the MEAN of a pile -- usually a point with no actual row
       sitting on it.  It is the pile's centre of gravity, not a member.

    TRIPWIRE 7:  The fit score's b uses the NEAREST other pile, not all of them.
       With 3+ piles it is tempting to average the gaps to every foreign dot.
       Wrong -- b is the average gap to the one CLOSEST foreign pile only.  The
       question is "would next door fit better?", and next door means the nearest
       neighbour pile, not the whole neighbourhood.

    TRIPWIRE 8:  Tightness is ONE total, not one-per-dot.
       Each dot gets a pile label; the GROUPING gets a single tightness number
       (all the squared gaps summed).  The fit score is the opposite: one score
       per dot first, averaged into one number at the end.  Do not mix the two
       shapes up.

    TRIPWIRE 9:  .labels_ gives pile number, not distance.
       After .fit, km.labels_ holds one integer per row -- which pile that
       row landed in (0, 1, 2 ... K-1).  It tells you WHICH pile, not HOW
       FAR from its centre.  To get distances, call km.transform(X), which
       returns a gap-to-every-centre matrix.

    TRIPWIRE 10:  STOP means no dot switched pile -- not "distances stopped changing."
       The loop halts when an entire round produces zero pile-switches: every
       dot ends in the same pile it was already in.  A dot CAN keep the same
       label while its distance to the centre changes -- the centre moved
       because its other pile-mates shifted.  The STOP check is "did this
       round's assignment exactly match the previous round's?" not "did the
       distances stop shrinking?"

    TRIPWIRE 11:  Distance lives only in the ASSIGN step.
       Two moves, one loop.  ASSIGN uses the straight-line gap to choose the
       nearest centre.  UPDATE uses column means to recompute each centre --
       no distances anywhere.  The UPDATE step averages coordinates; it never
       measures a gap.  Keeping the two moves separate makes the whole
       algorithm a lot cleaner to reason about.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    grouping with no answer column        clustering (unsupervised)
    a pile of dots                        a cluster
    the centre of a pile                  centroid / cluster mean
    assign each dot to nearest centre     the assignment step
    move each centre to its pile's mean   the update step
    tightness (sum of squared gaps)       inertia / within-cluster sum of squares
    the bend in the tightness curve       the elbow method
    the biggest sag below the ruler       the perpendicular-distance elbow rule
    home vs next-door fit score           silhouette coefficient
    average fit over all dots             mean silhouette score
    lean-and-stretch pile recipes         Gaussian mixture models (GMM)
    pick K at the highest peak            silhouette analysis
    run from several random starts        n_init / multiple restarts
    spread first centres far apart        k-means++ initialisation
    number of piles you pick              K (the n_clusters hyperparameter)


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       KMeans(n_clusters=K)   -- the grouping machine, K piles
       .fit(X)                -- run the two moves on a loop until settled
       .labels_               -- which pile each row landed in (0..K-1)
       .cluster_centers_      -- the final centre of each pile
       .inertia_              -- the tightness score (lower = tighter)
       silhouette_score(X, labels) -- the average home-vs-next-door fit score

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans

    # same-ruler first
    X_scaled = StandardScaler().fit_transform(df)

    # one grouping with K = 3
    km = KMeans(n_clusters=3, n_init=10, random_state=42)
    km.fit(X_scaled)
    labels = km.labels_              # pile per row
    centres = km.cluster_centers_    # the 3 centres
    print(km.inertia_)               # the tightness

    # the elbow: tightness for K = 1..8
    tightness = []
    for k in range(1, 9):
        km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
        tightness.append(km.inertia_)

    plt.plot(range(1, 9), tightness, marker="o")
    plt.xlabel("K (number of piles)")
    plt.ylabel("tightness (inertia)")
    plt.title("Elbow: where does adding piles stop helping?")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.show()
    # pick K at the bend

    # the second ruler: average fit score per K  (starts at K=2 -- see the note)
    from sklearn.metrics import silhouette_score
    fits = []
    for k in range(2, 9):
        km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
        fits.append(silhouette_score(X_scaled, km.labels_))

    plt.plot(range(2, 9), fits, marker="o")
    plt.xlabel("K (number of piles)")
    plt.ylabel("average fit score (silhouette)")
    plt.title("Fit score: pick K at the highest peak")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.show()
    # pick K at the PEAK -- no squinting at a bend


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
    Part 1 -- Looking at a Sheet With No Answers .
    Part 2 -- The Strongest Direction (PCA) .
    Part 3 (this post) .
    Part 4 -- The Family Tree (Hierarchical Clustering) .
    Part 5 -- Both Tools on NCI60 (Re-visited) .
    Part 6 -- Filling the Blanks (Recommender Systems)

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================