The Family Tree: Hierarchical Clustering and the Dendrogram

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 4 OF 6
  The Family Tree: Hierarchical Clustering and the Dendrogram
  Posted: 2026-06-09 . Author: Rahul Rai . Tags: hierarchical-clustering, dendrogram, linkage
  ============================================================================================

  PATH . post 19 of 28
    <- prev:  Chapter 6, Part 3: Grouping by Nearest Centre (K-Means)
       next:  Chapter 6, Part 5: Both Tools on NCI60 ->

  K-means in Part 3 made you pick K up front -- "I want 3 piles" -- and handed back 3
  piles, no questions asked.  But what if you do not KNOW how many groups there are?  What
  if you want to see the groups-inside-groups -- states that pair off, those pairs joining
  into regions, regions joining into halves of the country?

  Hierarchical clustering builds the whole family tree at once.  It never asks for K.  It
  starts with every row alone and keeps marrying the two closest groups, one wedding at a
  time, until everyone is in one giant family.  Then you read the tree and cut it wherever
  you like -- and the height of your cut decides how many groups you get.

  ## The Sheet (Same as Part 1)

    50 states, 3 measurements (Murder, Assault, UrbanPop).  No answer column.
    Standardised first, so every column sits on the same ruler.

  The one tool we need is a 50x50 SHEET OF GAPS: the straight-line distance between every
  pair of states (for each pair, subtract column by column, square the gaps, add them, take
  the root).  Hierarchical clustering is just a disciplined way to read that sheet, smallest
  gap first.

  Count that sheet in clerk-steps.  50 states make 50 x 49 / 2 = 1,225 distinct pairs;
  each straight-line gap over 3 columns costs 3 subtracts, 3 squarings, 2 adds, 1 root
  = 9 strokes, so the whole sheet is 1,225 x 9 = 11,025 strokes -- one clerk's long
  morning, and (by the row-merge trick below) the ONLY measuring the whole tree needs.

  ## The One Move: Marry the Two Closest

  Start with 50 groups -- every state is its own lonely group of one.  Then repeat one
  move until a single group remains:

    find the TWO closest groups in the whole sheet
    marry them into one new group
    -> 50 groups become 49, then 48, then 47 ... down to 1

  Each wedding is recorded with the HEIGHT at which it happened (the gap between the two
  groups that married).  Early weddings happen at tiny heights -- near-twins joining.
  Late weddings happen way up high -- forcing together groups that are nothing alike.

  >> NOTE: NO DICE IN THIS MACHINE
     K-means (Part 3) starts from random flags and can land badly -- you re-run it ten
     times and keep the best.  The family tree has no random start at all: the same
     sheet of gaps gives the SAME tree, every single run.  Nothing to re-roll.

  ## A Worked Example, by Hand

  Four dots on one ruler.  Gaps are just differences.

    dots:   1    2         6    8

  Start: every dot alone -> {1} {2} {6} {8}

    WEDDING 1 -- closest pair is {1} and {2}, gap 1 (smallest in the sheet).
       marry them at height 1  ->  {1,2} {6} {8}

    WEDDING 2 -- closest pair now is {6} and {8}, gap 2.
       marry them at height 2  ->  {1,2} {6,8}

    WEDDING 3 -- only two groups left: {1,2} and {6,8}.
       Using COMPLETE linkage (farthest cross-pair), list all four cross-pair gaps:
         |1-6| = 5,  |1-8| = 7,  |2-6| = 4,  |2-8| = 6
       Complete linkage takes the FARTHEST: max(5, 7, 4, 6) = 7.
       marry at height 7.
       ->  {1,2,6,8}   one family.  done.

  >> YOUR TURN
     Re-run WEDDING 3 under SINGLE linkage (closest cross-pair) and AVERAGE linkage
     (mean of the four cross-pairs |1-6|=5, |1-8|=7, |2-6|=4, |2-8|=6) instead of
     complete.

     check your slate:  SINGLE = min(5, 7, 4, 6) = 4;  AVERAGE = (5 + 7 + 4 + 6)/4
     = 22/4 = 5.5;  COMPLETE (shown above) = max = 7.  So the same four dots give
     root heights 4 < 5.5 < 7 -- single marries soonest, complete waits longest.

  Three weddings, at heights 1, 2, and 7.  Draw them as a tree:

    height
      7 |      +--------------+
        |      |              |
      2 |      |          +---+---+
        |      |          |       |
      1 |  +---+---+      |       |
        |  |       |      |       |
      0 +--1-------2------6-------8---
             twins        twins
        \____ joined late at height 7 ____/

  That tree is the DENDROGRAM.  Tall joins = forced marriages between unlike groups.
  Short joins = natural twins.

  ## Cut the Tree to Get Groups

  IN HAND: four dots 1, 2, 6, 8 married into one family closest-pair-first -- {1,2} at
  height 1, {6,8} at height 2, then the two pairs at height max(5,7,4,6) = 7.  This
  section reads groups back OUT of that finished tree.

  Here is the magic K-means could not do: you choose the number of groups AFTER seeing the
  tree, by drawing one horizontal line across it.

    cut LOW (height 1.5):  3 groups -> {1,2} {6} {8}
    cut MID (height 3):    2 groups -> {1,2} {6,8}
    cut HIGH (height 8):   1 group  -> everybody

  >> YOUR TURN
     The three weddings sit at heights 1, 2, and 7.  Slide your cut to height 5.  How
     many groups fall out?

     check your slate:  a cut keeps every wedding BELOW it and undoes every wedding
     ABOVE it.  Below 5: the height-1 and height-2 weddings stand ({1,2} and {6,8});
     above 5: the height-7 wedding is undone.  So 2 groups -- {1,2} and {6,8}.  Any
     cut between 2 and 7 gives the same answer.

  The number of branches your line slices through = the number of groups.  One tree, every
  possible K, read off by sliding a ruler up and down.  Cut just below the tallest jump and
  you get the most natural split -- the groups that resisted marrying the longest.

  ## How Do You Measure the Gap BETWEEN Two GROUPS?

  Marrying two lone dots is easy -- the gap is just their distance.  But once a group holds
  several dots, "the gap between two groups" needs a rule.  This rule is called the LINKAGE,
  and the choice changes the whole tree:

    SINGLE linkage    = gap between the two CLOSEST members (nearest edge to nearest edge)
                        -> makes long straggly chains: one close pair of toes is enough
                           to bridge two whole crowds, so groups grow by CHAINING
    COMPLETE linkage  = gap between the two FARTHEST members (worst case)
                        -> makes tight compact balls; cautious about marrying.  The
                           price: one loner inside a group speaks for the WHOLE group's
                           gap to everyone -- loner-sensitive
    AVERAGE linkage   = average gap over all cross-pairs
                        -> a sensible middle

  Build the tree under all three from the same sheet and the root heights line up
  single < average < complete -- the cautious worst-case rule always waits longest.
    WARD linkage      = marry the pair that adds the LEAST extra tightness (Part 3's
                        inertia idea) -> tends to give even, round, similar-sized groups

  ** KEY: THE LINKAGE IS A REAL CHOICE, NOT A DEFAULT TO IGNORE
     Single, complete, average, and Ward can build genuinely different trees from the
     SAME sheet of gaps.  Ward is the common default for blob-like data; single linkage
     is the one to reach for when groups are long thin shapes (and the one to fear when
     you do not want chaining).  Always say which linkage you used.

  ## The Engine Never Re-Measures (the Row-Merge Trick)

  A worry, caught while building it by hand: after every wedding, must you re-measure
  the gap from the new group to everybody else?  With 150 rows the starting sheet
  already holds 150 x 149 / 2 = 11,175 gaps -- re-measuring after each of 149 weddings
  sounds crushing.

  You never re-measure.  Most gaps do not involve the newly-weds at all and sit
  untouched.  And under complete linkage, even the gaps that DO touch the new group are
  built from numbers already on the sheet:

    new group {A,B}, any outsider C:
      gap(new group, C) = max( gap(A,C), gap(B,C) )   <- both already written down

  So one wedding costs three cheap moves:

    1. delete A's row and B's row from the sheet
    2. add ONE new row: position by position, keep the BIGGER of the two old numbers
    3. scan for the new smallest gap -> that pair marries next

  The ruler comes out once, at the very start.  After that, the entire tree is built by
  comparing numbers you already have -- no dot is ever measured twice.


  ## Agglomerative (Bottom-Up) vs Divisive (Top-Down)

  What we just did -- start with singletons and marry upward -- is BOTTOM-UP
  (agglomerative).  It is by far the common one.  The opposite, TOP-DOWN (divisive),
  starts with everyone in one group and splits the loosest group again and again.  Same
  tree idea, opposite direction.  This whole post is bottom-up.


  ## Reading the Merge Diary (the Z Table)

  Every wedding writes one line in a diary.  With 50 states there are 49 weddings,
  so the diary has 49 rows.  Four columns per row:

    column:    0          1         2           3
              who         who      gap        headcount
    row 0     dot17      dot42    0.31          2        <- first wedding (closest pair)
    row 1     dot3       dot9     0.44          2
    ...
    row -1    group      group    7.0          50        <- last wedding = the ROOT

  Column 2 is the GAP each wedding happened at -- the same height the dendrogram
  draws.  Because the family tree always marries the CLOSEST pair left, that gap can
  only grow as you walk down the diary: every wedding is at least as far apart as the
  one before.  So the BIGGEST gap is always the LAST row -- the root, where the final
  two camps are forced together.

    Z[-1, 2]  =  last row, gap column  =  the root's height  =  the biggest merge gap

  That one number is the fingerprint of the whole tree's top: big means the two
  top-level camps stood genuinely far apart (the groups mean something); small means
  everything was mushed close together anyway.

  >> NOTE: ROW -1 MEANS "LAST ROW", COUNTING FROM THE BOTTOM
     Instead of remembering the diary has 49 rows and writing Z[48, 2], the minus
     sign counts from the end: -1 is the last row, -2 the second-to-last.  Z[-1, 2]
     reads the root's gap without you ever counting the rows.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  Standardise first -- again.
       Like every distance method in this chapter, hierarchical clustering reads the
       sheet of gaps, which is hijacked by a loud column unless you put every column
       on the same ruler first.

    TRIPWIRE 2:  The dendrogram HEIGHT is the gap, not the count.
       The vertical axis is the distance at which two groups married.  It is NOT how
       many dots are in a group.  A tall bar means "these two were far apart and got
       married late," not "this is a big group."

    TRIPWIRE 3:  Different linkage -> different tree.
       If your tree looks like one long chain, you are probably on single linkage
       and seeing chaining.  Switch to Ward or complete for compact groups.  Neither
       is wrong; they answer different questions.

    TRIPWIRE 4:  No K up front, but you still choose in the end.
       Hierarchical clustering postpones the K decision -- it does not erase it.  You
       pick K when you cut the tree.  Cutting just below the biggest vertical gap is
       the usual rule of thumb.

    TRIPWIRE 5:  It is greedy and permanent.
       Once two groups marry, they never divorce.  An early wedding made on a small
       gap is locked in even if it looks wrong later.  Like the decision tree in
       Chapter 5, every step is locally best, not globally guaranteed.

    TRIPWIRE 6:  It does not scale cheaply -- count it.
       The sheet of gaps holds N x (N-1) / 2 numbers.  50 rows -> 1,225 gaps.
       150 rows -> 11,175.  10,000 rows -> about 50 MILLION, before the first
       wedding even happens.  K-means is far cheaper at large scale; the family
       tree shines when you have a few hundred rows and want the whole hierarchy.

    TRIPWIRE 7:  An unpaired dot is a SINGLE, not a group of one.
       After the very first wedding you have N-2 singles and 1 pair.  Every
       dot that has not yet married is still just itself -- do NOT call it
       "a group of one."  A GROUP is formed by a wedding; a dot that has not
       married is a SINGLE.  Calling lone dots "groups" inflates the group
       count and makes later steps confusing to count.

    TRIPWIRE 8:  The biggest merge gap is always the LAST row, never the first.
       The diary's gap column (column 2) only grows downward -- closest pairs
       marry first, far-apart camps marry last.  So the root's gap sits in row
       -1, and reaching for row 0 gives you the SMALLEST gap (the first twins),
       the exact opposite of what you wanted.

    TRIPWIRE 9:  .idxmax() hands you the NAME; .max() hands you the NUMBER.
       Describing each group ("which group runs highest on Murder?"), the question
       wants a group NAME -- that is .idxmax().  Reaching for .max() returns the
       value itself, a different question, and the two read almost identically in
       code.  Say out loud which one the question asks for BEFORE typing the dot.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    the family tree                       dendrogram
    marry the two closest groups          agglomerative (bottom-up) merge step
    height of a wedding                   merge distance / cophenetic height
    cut the tree with a flat line         choosing the number of clusters by threshold
    gap between two groups                linkage criterion
    nearest-edge gap                      single linkage
    farthest-edge gap                     complete linkage
    average cross-pair gap                average linkage
    least-extra-tightness marriage        Ward linkage
    start whole, split down               divisive (top-down) clustering
    the sheet of gaps (from Part 1)       the pairwise distance matrix
    the merge diary (Z table)             the linkage matrix
    the root's gap (Z[-1, 2])             the maximum merge distance
    the row-merge trick (keep the bigger)  the Lance-Williams update (complete linkage)


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       linkage(X, method=...)        -- build the whole family tree (the weddings)
       dendrogram(Z)                 -- draw the tree
       fcluster(Z, t, criterion)     -- cut the tree into flat groups
       AgglomerativeClustering(...)  -- the sklearn one-call version

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import StandardScaler
    from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

    # same-ruler first
    X_scaled = StandardScaler().fit_transform(df)

    # build the tree (Ward linkage = compact, even groups)
    Z = linkage(X_scaled, method="ward")

    # draw it
    plt.figure(figsize=(10, 5))
    dendrogram(Z, labels=df.index.tolist(), leaf_rotation=90)
    plt.ylabel("merge height (gap)")
    plt.title("The Family Tree of States")
    plt.tight_layout()
    plt.show()

    # cut into 3 groups
    groups = fcluster(Z, t=3, criterion="maxclust")   # one label per row

    # describe each group: the middle of every column, per group
    desc = df.assign(group=groups).groupby("group").mean()
    desc["Murder"].idxmax()   # NAME of the group where Murder runs highest
    desc["Murder"].max()      # the number itself -- a different question

    # the sklearn one-shot equivalent
    from sklearn.cluster import AgglomerativeClustering
    agg = AgglomerativeClustering(n_clusters=3, linkage="ward")
    labels = agg.fit_predict(X_scaled)


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
    Part 1 -- Looking at a Sheet With No Answers .
    Part 2 -- The Strongest Direction (PCA) .
    Part 3 -- Grouping by Nearest Centre (K-Means) .
    Part 4 (this post) .
    Part 5 -- Both Tools on NCI60 (Re-visited) .
    Part 6 -- Filling the Blanks (Recommender Systems)

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================