==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 4 OF 6
The Family Tree: Hierarchical Clustering and the Dendrogram
Posted: 2026-06-09 . Author: Rahul Rai . Tags: hierarchical-clustering, dendrogram, linkage
============================================================================================
PATH . post 19 of 28
<- prev: Chapter 6, Part 3: Grouping by Nearest Centre (K-Means)
next: Chapter 6, Part 5: Both Tools on NCI60 ->
K-means in Part 3 made you pick K up front -- "I want 3 piles" -- and handed back 3
piles, no questions asked. But what if you do not KNOW how many groups there are? What
if you want to see the groups-inside-groups -- states that pair off, those pairs joining
into regions, regions joining into halves of the country?
Hierarchical clustering builds the whole family tree at once. It never asks for K. It
starts with every row alone and keeps marrying the two closest groups, one wedding at a
time, until everyone is in one giant family. Then you read the tree and cut it wherever
you like -- and the height of your cut decides how many groups you get.
## The Sheet (Same as Part 1)
50 states, 3 measurements (Murder, Assault, UrbanPop). No answer column.
Standardised first, so every column sits on the same ruler.
The one tool we need is a 50x50 SHEET OF GAPS: the straight-line distance between every
pair of states (for each pair, subtract column by column, square the gaps, add them, take
the root). Hierarchical clustering is just a disciplined way to read that sheet, smallest
gap first.
Count that sheet in clerk-steps. 50 states make 50 x 49 / 2 = 1,225 distinct pairs;
each straight-line gap over 3 columns costs 3 subtracts, 3 squarings, 2 adds, 1 root
= 9 strokes, so the whole sheet is 1,225 x 9 = 11,025 strokes -- one clerk's long
morning, and (by the row-merge trick below) the ONLY measuring the whole tree needs.
## The One Move: Marry the Two Closest
Start with 50 groups -- every state is its own lonely group of one. Then repeat one
move until a single group remains:
find the TWO closest groups in the whole sheet
marry them into one new group
-> 50 groups become 49, then 48, then 47 ... down to 1
Each wedding is recorded with the HEIGHT at which it happened (the gap between the two
groups that married). Early weddings happen at tiny heights -- near-twins joining.
Late weddings happen way up high -- forcing together groups that are nothing alike.
>> NOTE: NO DICE IN THIS MACHINE
K-means (Part 3) starts from random flags and can land badly -- you re-run it ten
times and keep the best. The family tree has no random start at all: the same
sheet of gaps gives the SAME tree, every single run. Nothing to re-roll.
## A Worked Example, by Hand
Four dots on one ruler. Gaps are just differences.
dots: 1 2 6 8
Start: every dot alone -> {1} {2} {6} {8}
WEDDING 1 -- closest pair is {1} and {2}, gap 1 (smallest in the sheet).
marry them at height 1 -> {1,2} {6} {8}
WEDDING 2 -- closest pair now is {6} and {8}, gap 2.
marry them at height 2 -> {1,2} {6,8}
WEDDING 3 -- only two groups left: {1,2} and {6,8}.
Using COMPLETE linkage (farthest cross-pair), list all four cross-pair gaps:
|1-6| = 5, |1-8| = 7, |2-6| = 4, |2-8| = 6
Complete linkage takes the FARTHEST: max(5, 7, 4, 6) = 7.
marry at height 7.
-> {1,2,6,8} one family. done.
>> YOUR TURN
Re-run WEDDING 3 under SINGLE linkage (closest cross-pair) and AVERAGE linkage
(mean of the four cross-pairs |1-6|=5, |1-8|=7, |2-6|=4, |2-8|=6) instead of
complete.
check your slate: SINGLE = min(5, 7, 4, 6) = 4; AVERAGE = (5 + 7 + 4 + 6)/4
= 22/4 = 5.5; COMPLETE (shown above) = max = 7. So the same four dots give
root heights 4 < 5.5 < 7 -- single marries soonest, complete waits longest.
Three weddings, at heights 1, 2, and 7. Draw them as a tree:
height
7 | +--------------+
| | |
2 | | +---+---+
| | | |
1 | +---+---+ | |
| | | | |
0 +--1-------2------6-------8---
twins twins
\____ joined late at height 7 ____/
That tree is the DENDROGRAM. Tall joins = forced marriages between unlike groups.
Short joins = natural twins.
## Cut the Tree to Get Groups
IN HAND: four dots 1, 2, 6, 8 married into one family closest-pair-first -- {1,2} at
height 1, {6,8} at height 2, then the two pairs at height max(5,7,4,6) = 7. This
section reads groups back OUT of that finished tree.
Here is the magic K-means could not do: you choose the number of groups AFTER seeing the
tree, by drawing one horizontal line across it.
cut LOW (height 1.5): 3 groups -> {1,2} {6} {8}
cut MID (height 3): 2 groups -> {1,2} {6,8}
cut HIGH (height 8): 1 group -> everybody
>> YOUR TURN
The three weddings sit at heights 1, 2, and 7. Slide your cut to height 5. How
many groups fall out?
check your slate: a cut keeps every wedding BELOW it and undoes every wedding
ABOVE it. Below 5: the height-1 and height-2 weddings stand ({1,2} and {6,8});
above 5: the height-7 wedding is undone. So 2 groups -- {1,2} and {6,8}. Any
cut between 2 and 7 gives the same answer.
The number of branches your line slices through = the number of groups. One tree, every
possible K, read off by sliding a ruler up and down. Cut just below the tallest jump and
you get the most natural split -- the groups that resisted marrying the longest.
## How Do You Measure the Gap BETWEEN Two GROUPS?
Marrying two lone dots is easy -- the gap is just their distance. But once a group holds
several dots, "the gap between two groups" needs a rule. This rule is called the LINKAGE,
and the choice changes the whole tree:
SINGLE linkage = gap between the two CLOSEST members (nearest edge to nearest edge)
-> makes long straggly chains: one close pair of toes is enough
to bridge two whole crowds, so groups grow by CHAINING
COMPLETE linkage = gap between the two FARTHEST members (worst case)
-> makes tight compact balls; cautious about marrying. The
price: one loner inside a group speaks for the WHOLE group's
gap to everyone -- loner-sensitive
AVERAGE linkage = average gap over all cross-pairs
-> a sensible middle
Build the tree under all three from the same sheet and the root heights line up
single < average < complete -- the cautious worst-case rule always waits longest.
WARD linkage = marry the pair that adds the LEAST extra tightness (Part 3's
inertia idea) -> tends to give even, round, similar-sized groups
** KEY: THE LINKAGE IS A REAL CHOICE, NOT A DEFAULT TO IGNORE
Single, complete, average, and Ward can build genuinely different trees from the
SAME sheet of gaps. Ward is the common default for blob-like data; single linkage
is the one to reach for when groups are long thin shapes (and the one to fear when
you do not want chaining). Always say which linkage you used.
## The Engine Never Re-Measures (the Row-Merge Trick)
A worry, caught while building it by hand: after every wedding, must you re-measure
the gap from the new group to everybody else? With 150 rows the starting sheet
already holds 150 x 149 / 2 = 11,175 gaps -- re-measuring after each of 149 weddings
sounds crushing.
You never re-measure. Most gaps do not involve the newly-weds at all and sit
untouched. And under complete linkage, even the gaps that DO touch the new group are
built from numbers already on the sheet:
new group {A,B}, any outsider C:
gap(new group, C) = max( gap(A,C), gap(B,C) ) <- both already written down
So one wedding costs three cheap moves:
1. delete A's row and B's row from the sheet
2. add ONE new row: position by position, keep the BIGGER of the two old numbers
3. scan for the new smallest gap -> that pair marries next
The ruler comes out once, at the very start. After that, the entire tree is built by
comparing numbers you already have -- no dot is ever measured twice.
## Agglomerative (Bottom-Up) vs Divisive (Top-Down)
What we just did -- start with singletons and marry upward -- is BOTTOM-UP
(agglomerative). It is by far the common one. The opposite, TOP-DOWN (divisive),
starts with everyone in one group and splits the loosest group again and again. Same
tree idea, opposite direction. This whole post is bottom-up.
## Reading the Merge Diary (the Z Table)
Every wedding writes one line in a diary. With 50 states there are 49 weddings,
so the diary has 49 rows. Four columns per row:
column: 0 1 2 3
who who gap headcount
row 0 dot17 dot42 0.31 2 <- first wedding (closest pair)
row 1 dot3 dot9 0.44 2
...
row -1 group group 7.0 50 <- last wedding = the ROOT
Column 2 is the GAP each wedding happened at -- the same height the dendrogram
draws. Because the family tree always marries the CLOSEST pair left, that gap can
only grow as you walk down the diary: every wedding is at least as far apart as the
one before. So the BIGGEST gap is always the LAST row -- the root, where the final
two camps are forced together.
Z[-1, 2] = last row, gap column = the root's height = the biggest merge gap
That one number is the fingerprint of the whole tree's top: big means the two
top-level camps stood genuinely far apart (the groups mean something); small means
everything was mushed close together anyway.
>> NOTE: ROW -1 MEANS "LAST ROW", COUNTING FROM THE BOTTOM
Instead of remembering the diary has 49 rows and writing Z[48, 2], the minus
sign counts from the end: -1 is the last row, -2 the second-to-last. Z[-1, 2]
reads the root's gap without you ever counting the rows.
## Common Tripwires I Caught
TRIPWIRE 1: Standardise first -- again.
Like every distance method in this chapter, hierarchical clustering reads the
sheet of gaps, which is hijacked by a loud column unless you put every column
on the same ruler first.
TRIPWIRE 2: The dendrogram HEIGHT is the gap, not the count.
The vertical axis is the distance at which two groups married. It is NOT how
many dots are in a group. A tall bar means "these two were far apart and got
married late," not "this is a big group."
TRIPWIRE 3: Different linkage -> different tree.
If your tree looks like one long chain, you are probably on single linkage
and seeing chaining. Switch to Ward or complete for compact groups. Neither
is wrong; they answer different questions.
TRIPWIRE 4: No K up front, but you still choose in the end.
Hierarchical clustering postpones the K decision -- it does not erase it. You
pick K when you cut the tree. Cutting just below the biggest vertical gap is
the usual rule of thumb.
TRIPWIRE 5: It is greedy and permanent.
Once two groups marry, they never divorce. An early wedding made on a small
gap is locked in even if it looks wrong later. Like the decision tree in
Chapter 5, every step is locally best, not globally guaranteed.
TRIPWIRE 6: It does not scale cheaply -- count it.
The sheet of gaps holds N x (N-1) / 2 numbers. 50 rows -> 1,225 gaps.
150 rows -> 11,175. 10,000 rows -> about 50 MILLION, before the first
wedding even happens. K-means is far cheaper at large scale; the family
tree shines when you have a few hundred rows and want the whole hierarchy.
TRIPWIRE 7: An unpaired dot is a SINGLE, not a group of one.
After the very first wedding you have N-2 singles and 1 pair. Every
dot that has not yet married is still just itself -- do NOT call it
"a group of one." A GROUP is formed by a wedding; a dot that has not
married is a SINGLE. Calling lone dots "groups" inflates the group
count and makes later steps confusing to count.
TRIPWIRE 8: The biggest merge gap is always the LAST row, never the first.
The diary's gap column (column 2) only grows downward -- closest pairs
marry first, far-apart camps marry last. So the root's gap sits in row
-1, and reaching for row 0 gives you the SMALLEST gap (the first twins),
the exact opposite of what you wanted.
TRIPWIRE 9: .idxmax() hands you the NAME; .max() hands you the NUMBER.
Describing each group ("which group runs highest on Murder?"), the question
wants a group NAME -- that is .idxmax(). Reaching for .max() returns the
value itself, a different question, and the two read almost identically in
code. Say out loud which one the question asks for BEFORE typing the dot.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
the family tree dendrogram
marry the two closest groups agglomerative (bottom-up) merge step
height of a wedding merge distance / cophenetic height
cut the tree with a flat line choosing the number of clusters by threshold
gap between two groups linkage criterion
nearest-edge gap single linkage
farthest-edge gap complete linkage
average cross-pair gap average linkage
least-extra-tightness marriage Ward linkage
start whole, split down divisive (top-down) clustering
the sheet of gaps (from Part 1) the pairwise distance matrix
the merge diary (Z table) the linkage matrix
the root's gap (Z[-1, 2]) the maximum merge distance
the row-merge trick (keep the bigger) the Lance-Williams update (complete linkage)
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
linkage(X, method=...) -- build the whole family tree (the weddings)
dendrogram(Z) -- draw the tree
fcluster(Z, t, criterion) -- cut the tree into flat groups
AgglomerativeClustering(...) -- the sklearn one-call version
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
# same-ruler first
X_scaled = StandardScaler().fit_transform(df)
# build the tree (Ward linkage = compact, even groups)
Z = linkage(X_scaled, method="ward")
# draw it
plt.figure(figsize=(10, 5))
dendrogram(Z, labels=df.index.tolist(), leaf_rotation=90)
plt.ylabel("merge height (gap)")
plt.title("The Family Tree of States")
plt.tight_layout()
plt.show()
# cut into 3 groups
groups = fcluster(Z, t=3, criterion="maxclust") # one label per row
# describe each group: the middle of every column, per group
desc = df.assign(group=groups).groupby("group").mean()
desc["Murder"].idxmax() # NAME of the group where Murder runs highest
desc["Murder"].max() # the number itself -- a different question
# the sklearn one-shot equivalent
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=3, linkage="ward")
labels = agg.fit_predict(X_scaled)
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
Part 1 -- Looking at a Sheet With No Answers .
Part 2 -- The Strongest Direction (PCA) .
Part 3 -- Grouping by Nearest Centre (K-Means) .
Part 4 (this post) .
Part 5 -- Both Tools on NCI60 (Re-visited) .
Part 6 -- Filling the Blanks (Recommender Systems)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================