==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 5 OF 6
Both Tools on NCI60: PCA and Clustering on Real Gene Data
Posted: 2026-06-09 . Author: Rahul Rai . Tags: pca, clustering, nci60, case-study
============================================================================================
PATH . post 20 of 28
<- prev: Chapter 6, Part 4: The Family Tree (Hierarchical Clustering)
next: Chapter 6, Part 6: Filling the Blanks (Recommender Systems) ->
Earlier we built three tools in clean rooms: PCA (find the few directions the data spreads
along most, and crush every row onto them), K-means (drop K centres and pull each dot to
its nearest, then move each centre to its pile's middle, repeat), and the family tree
(marry the two closest groups over and over, recording each wedding's height). This post
turns all three loose on ONE hard, real sheet at once -- and shows what happens when the
room has more walls than you have rows.
The twist of this chapter is honesty: we DO have labels for these samples (the cancer
type of each), but we hide them while we cluster. We let the unsupervised tools find
groups blind, and ONLY at the end do we peek at the labels to ask: did the machine's
blind groups line up with the real cancer types? That peek is a check, never an input.
## The Sheet: Wide, Not Tall
NCI60 -- 64 cell-line samples, ~6,830 gene measurements each
(each measurement = log-brightness of one gene's activity in that cell line:
+0.30 means that gene is a little more active than average; -1.20 means well below)
sample gene1 gene2 gene3 ... gene6830 | (hidden) type
------ ----- ----- ----- ... -------- | -------------
1 0.30 -1.20 0.05 ... 0.88 | MELANOMA
2 -0.45 0.10 1.33 ... -0.21 | LEUKEMIA
... ... ... ... ... ... | ...
64 0.12 0.77 -0.90 ... 0.04 | RENAL
64 rows. 6,830 columns. The type column is sealed away during clustering.
Stop and look at the shape: 64 rows, 6,830 columns. WAY more columns than rows. This
is the opposite of every sheet so far, and it breaks your gut feeling. In a 6,830-wall
room with only 64 dots, EVERYTHING is far from everything else, and almost any direction
looks like it separates the dots. This is the curse of dimensionality from Chapter 1,
Part 2, in its purest form.
Hand a clerk the full sheet and ask him to copy every cell once: 64 rows x 6,830 columns
= 64 x 6,830 = 437,120 cells. One blackboard cannot hold it; the room can, but the bill
is plain -- nearly half a million strokes just to write the sheet down once.
## You Cannot Eyeball 6,830 Walls, So Crush With PCA First
You cannot eyeball a 6,830-wall room. So reach for Part 2: PCA crushes the 6,830 columns
down to the few strongest shadows. Plot PC1 vs PC2 and color each dot by its (peeked)
type just to see:
PC2 ^
| L L M M
| L L L M M M L = leukemia clump
| L L M M M = melanoma clump
| R R
| R R R R = renal clump
+-------------------------> PC1
Even crushed to 2 numbers out of 6,830, samples of the same cancer often land near each
other. That is the headline: the strongest shadows of the gene soup already carry a lot
of the cancer-type signal, with no labels used to find them.
>> NOTE: PCA IS THE WARM-UP, NOT THE ANSWER
On wide data, people often run PCA FIRST (down to, say, the top few PCs) and THEN
cluster on those few PC scores instead of all 6,830 noisy columns. Crushing away
the 6,000-plus weak directions throws out noise and makes the clustering steadier.
## Now Carve the Crushed Sheet Into a Family Tree
IN HAND: a sheet of 64 samples x 6,830 genes = 64 x 6,830 = 437,120 cells, crushed by
PCA down to a few honest shadows we could plot and eyeball. This section stops looking
and starts carving: build the family tree on those samples and cut it into groups.
Now Part 4 on the same sheet. Standardise the genes, build the dendrogram -- a branching
tree where each fork's height is the gap at that merge -- (Ward or complete linkage), and
cut it into a handful of groups:
height
| +-------------------+
| | |
| +---+---+ +-----+-----+
| | | | |
| leukemia ... melanoma renal-ish
+--------------------------------------------
Cut the tree at a chosen height to get, say, 4 groups. Then -- and only then -- unseal
the type column and lay the blind groups beside the real types.
## Only Now, Unseal the Types: Did the Blind Groups Match?
IN HAND: 64 samples crushed by PCA, then carved by the family tree into a handful of
blind groups -- built from gene spread alone, with the type column still sealed. Now,
and only now, unseal the types and lay them beside the blind groups to grade the match.
Make a little table -- blind group down the side, true type across the top -- and count:
MELANOMA LEUKEMIA RENAL OTHER
group 1 [ 8 0 1 1 ] <- mostly melanoma
group 2 [ 0 6 0 0 ] <- pure leukemia
group 3 [ 1 0 7 2 ] <- mostly renal
group 4 [ mixed bag of the leftovers ]
Read group 1's row across: 8 + 0 + 1 + 1 = 10 samples in that group, and 8 of the 10
are melanoma -- 8 / 10 = 0.8, four-fifths pure. Group 2's row: 0 + 6 + 0 + 0 = 6
samples, all leukemia -- 6 / 6 = 1, dead pure.
>> YOUR TURN
Grade group 3 the same way. Add its row across, then take the melanoma-... no,
take its biggest pile over its total.
check your slate: row is 1, 0, 7, 2; total = 1 + 0 + 7 + 2 = 10; biggest pile
is renal at 7; purity = 7 / 10 = 0.7. Group 3 is seven-tenths renal -- cleaner
than a coin-toss muddle, dirtier than the pure leukemia group.
Some groups come out almost pure -- leukemia samples cluster tightly together. Others
are a muddle. That is the real-world result: unsupervised tools recover SOME of the true
structure, not all of it. A pure group means those genes really do separate that cancer;
a muddled group means the gene signal for those types overlaps.
** KEY: PEEKING AT LABELS IS A SCORECARD, NOT A STEP
The cancer types never touched the clustering -- they were sealed away the entire
time. We only unsealed them at the very end to GRADE the blind groups. If you let
the labels influence the grouping, you are no longer doing unsupervised learning; you
are just drawing the answer you already had.
## Scoring Two Groupings Against Each Other (One Number)
IN HAND: a sheet of 64 samples x 6,830 genes = 437,120 cells, crushed by PCA and carved
by the family tree into blind groups we graded by hand (group 1 came out 8 / 10 = 0.8
melanoma, group 2 came out 6 / 6 = 1 leukemia). Now run a SECOND tool, K-means, on the
same sheet, and ask one number: do the two tools' groupings agree?
Run K-means and the family tree on the same sheet and you get two columns of pile
numbers. Did the two tools find the SAME groups? You cannot just line the columns
up and check row by row, because the pile numbers are arbitrary STICKERS:
dot K-means family tree
1 2 1
2 2 1
3 0 3
Every row "disagrees" on the number -- yet the two tools may have built identical
groups. K-means called a pile "2" where the tree called the same pile "1". Asking
"is 2 == 1?" tests the stickers, not the grouping.
The fix is to stop comparing names and compare PAIRS instead. Take two dots at a
time and ask each tool one sticker-free question: are these two TOGETHER or APART?
pair (dot 1, dot 2): K-means 2,2 -> together. tree 1,1 -> together. AGREE.
pair (dot 1, dot 3): K-means 2,0 -> apart. tree 1,3 -> apart. AGREE.
Walk every pair, count how often the two tools AGREE (both "together" or both
"apart"), divide by the number of pairs. That fraction is the raw agreement.
>> YOUR TURN
The third pair is (dot 2, dot 3). Read it off the little table above -- K-means
gave dot 2 a "2" and dot 3 a "0"; the tree gave dot 2 a "1" and dot 3 a "3". Do
the two tools agree on this pair? And with 3 dots, how many pairs are there in
all, and what is the raw agreement?
check your slate: K-means 2,0 -> apart; tree 1,3 -> apart; both apart -> AGREE.
Pairs among 3 dots: (1,2), (1,3), (2,3) = 3 pairs. All three AGREE, so raw
agreement = 3 / 3 = 1. The two tools built the very same grouping here.
But raw agreement flatters garbage. Two blindfolded taggers slapping random pile
numbers on the dots will still AGREE on most pairs -- because in any grouping most
pairs land APART, and apart-plus-apart counts as agreement by sheer luck. So the
raw fraction sits fat above zero even for nonsense.
The cure is to slide the zero onto that luck mark:
raw ruler: 0 ----------- luck ------- 1
re-zeroed ruler: (minus) -- 0 --------- 1
worse-than-luck luck identical groupings
The re-zeroed score is the ADJUSTED RAND SCORE. It reads +1 when the two groupings
are identical, 0 when they agree no better than two blindfolded taggers, and goes
negative when they agree even LESS than random.
** KEY: ONE NUMBER, AND IT DOES NOT CARE WHICH TOOL YOU NAME FIRST
The score asks "do these two dots get the same fate in both groupings?" -- a
question that reads the same backwards. Swapping the two columns leaves every
pair's together/apart answer untouched, so the score never changes. It is a
symmetric agreement number, not "how well did tool B copy tool A".
## Why Two Tools, Not One
PCA and clustering answer different questions and back each other up:
PCA -> "draw me the cloud so I can SEE it" (crush walls, keep the picture)
clustering -> "carve the cloud into named piles" (K-means or the family tree)
On NCI60 they work as a pair: PCA crushes 6,830 noisy walls to a few honest ones, and
clustering carves the crushed cloud into groups. Looking at either alone tells half the
story; together they show both the shape AND the groupings.
## Common Tripwires I Caught
TRIPWIRE 1: More columns than rows flips your gut feeling.
With 6,830 walls and 64 dots, distances bunch up -- everything is roughly
equally far from everything. Do not trust a clustering on all raw columns;
crush with PCA first or the gaps are mostly noise.
TRIPWIRE 2: Standardise the genes before any distance.
Same rule as all of Chapter 6. A handful of loud genes would otherwise drown
out thousands of quiet ones.
TRIPWIRE 3: Clustering and the real labels need not agree.
The blind groups are built from gene SPREAD, not from the cancer names. A
mismatch is not a bug -- it means the genes do not cleanly separate those
types. A match is a pleasant confirmation, not a guarantee.
TRIPWIRE 4: Do not feed the labels into the clustering.
It is tempting to "help" the machine by sneaking the type in. That destroys
the whole point. Cluster blind, grade afterward.
TRIPWIRE 5: PCA-then-cluster vs cluster-on-raw give different groups.
Clustering on the top few PCs (denoised) and clustering on all 6,830 raw
columns can hand back different piles. The PCA-first route is usually
steadier on wide data, but say which one you did.
TRIPWIRE 6: K-means and the family tree can disagree here too.
The two grouping methods need not produce the same piles on the same sheet --
K-means assumes round blobs; the tree depends on the linkage. Comparing them
is itself informative: agreement is a good sign the groups are real.
TRIPWIRE 7: Pile numbers are stickers -- never compare them directly.
Tool A's "pile 2" and tool B's "pile 1" can be the EXACT same dots. Lining
the two label columns up and checking "2 == 1?" tests the arbitrary names,
not the grouping. Compare PAIRS (together/apart) instead, which is what the
adjusted Rand score does under the hood.
TRIPWIRE 8: Raw agreement flatters random labels; re-zero at chance.
Most pairs land apart in any grouping, so apart-plus-apart racks up
"agreement" by luck -- two random taggers still score fat above zero. The
adjusted Rand score subtracts that luck baseline, so 0 means chance-level and
+1 means identical. Reading the raw fraction makes nonsense look like a match.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
wide sheet (more columns than rows) high-dimensional / p >> n data
crush first, then cluster PCA as a preprocessing / denoising step
blind groups unsupervised cluster assignments
peek at the sealed types external validation against ground truth
pure group high-purity cluster
muddled group low-purity / mixed cluster
side-by-side count table contingency table / confusion of clusters vs labels
everything is far from everything the curse of dimensionality
together-or-apart, then agree/disagree pair-counting (the Rand index)
re-zeroed agreement score (-1..1) adjusted Rand score (adjusted_rand_score)
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
pd.crosstab(a, b) -- the side-by-side count table (groups vs true types)
PCA(n_components=k) -- crush to k strongest shadows (from Part 2)
KMeans / linkage -- the grouping tools (from Parts 3 and 4)
adjusted_rand_score(a, b) -- do two groupings agree? one number, -1..1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
# X = (64, 6830) gene matrix ; types = the sealed labels (used only to grade)
X_scaled = StandardScaler().fit_transform(X)
# crush first: keep the strongest shadows
pca = PCA(n_components=5)
scores = pca.fit_transform(X_scaled)
print(pca.explained_variance_ratio_.cumsum()) # how much we kept
# see it: PC1 vs PC2
plt.scatter(scores[:, 0], scores[:, 1], edgecolor="k")
plt.xlabel("PC1"); plt.ylabel("PC2"); plt.title("NCI60 crushed to 2D")
plt.show()
# cluster on the crushed scores
km_labels = KMeans(n_clusters=4, n_init=10, random_state=42).fit_predict(scores)
agg_labels = AgglomerativeClustering(n_clusters=4, linkage="ward").fit_predict(scores)
# do the two tools agree? one number, sticker-free (+1 identical, 0 chance)
from sklearn.metrics import adjusted_rand_score
print(adjusted_rand_score(km_labels, agg_labels)) # swapping args = same score
# grade blind groups against the sealed types (the peek)
print(pd.crosstab(pd.Series(km_labels, name="group"),
pd.Series(types, name="true_type")))
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
Part 1 -- Looking at a Sheet With No Answers .
Part 2 -- The Strongest Direction (PCA) .
Part 3 -- Grouping by Nearest Centre (K-Means) .
Part 4 -- The Family Tree (Hierarchical Clustering) .
Part 5 (this post) .
Part 6 -- Filling the Blanks (Recommender Systems)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================