Both Tools on NCI60: PCA and Clustering on Real Gene Data

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 5 OF 6
  Both Tools on NCI60: PCA and Clustering on Real Gene Data
  Posted: 2026-06-09 . Author: Rahul Rai . Tags: pca, clustering, nci60, case-study
  ============================================================================================

  PATH . post 20 of 28
    <- prev:  Chapter 6, Part 4: The Family Tree (Hierarchical Clustering)
       next:  Chapter 6, Part 6: Filling the Blanks (Recommender Systems) ->

  Earlier we built three tools in clean rooms: PCA (find the few directions the data spreads
  along most, and crush every row onto them), K-means (drop K centres and pull each dot to
  its nearest, then move each centre to its pile's middle, repeat), and the family tree
  (marry the two closest groups over and over, recording each wedding's height).  This post
  turns all three loose on ONE hard, real sheet at once -- and shows what happens when the
  room has more walls than you have rows.

  The twist of this chapter is honesty: we DO have labels for these samples (the cancer
  type of each), but we hide them while we cluster.  We let the unsupervised tools find
  groups blind, and ONLY at the end do we peek at the labels to ask: did the machine's
  blind groups line up with the real cancer types?  That peek is a check, never an input.

  ## The Sheet: Wide, Not Tall

    NCI60 -- 64 cell-line samples, ~6,830 gene measurements each
    (each measurement = log-brightness of one gene's activity in that cell line:
     +0.30 means that gene is a little more active than average; -1.20 means well below)

      sample   gene1   gene2   gene3   ...   gene6830   | (hidden) type
      ------   -----   -----   -----   ...   --------   | -------------
       1        0.30   -1.20    0.05   ...      0.88    | MELANOMA
       2       -0.45    0.10    1.33   ...     -0.21    | LEUKEMIA
       ...      ...      ...     ...   ...       ...    | ...
       64       0.12    0.77   -0.90   ...      0.04    | RENAL

    64 rows.  6,830 columns.  The type column is sealed away during clustering.

  Stop and look at the shape: 64 rows, 6,830 columns.  WAY more columns than rows.  This
  is the opposite of every sheet so far, and it breaks your gut feeling.  In a 6,830-wall
  room with only 64 dots, EVERYTHING is far from everything else, and almost any direction
  looks like it separates the dots.  This is the curse of dimensionality from Chapter 1,
  Part 2, in its purest form.

  Hand a clerk the full sheet and ask him to copy every cell once: 64 rows x 6,830 columns
  = 64 x 6,830 = 437,120 cells.  One blackboard cannot hold it; the room can, but the bill
  is plain -- nearly half a million strokes just to write the sheet down once.

  ## You Cannot Eyeball 6,830 Walls, So Crush With PCA First

  You cannot eyeball a 6,830-wall room.  So reach for Part 2: PCA crushes the 6,830 columns
  down to the few strongest shadows.  Plot PC1 vs PC2 and color each dot by its (peeked)
  type just to see:

    PC2 ^
        |   L L              M M
        |  L  L L           M  M M     L = leukemia clump
        |   L L              M M       M = melanoma clump
        |          R  R
        |         R R  R               R = renal clump
        +-------------------------> PC1

  Even crushed to 2 numbers out of 6,830, samples of the same cancer often land near each
  other.  That is the headline: the strongest shadows of the gene soup already carry a lot
  of the cancer-type signal, with no labels used to find them.

  >> NOTE: PCA IS THE WARM-UP, NOT THE ANSWER
     On wide data, people often run PCA FIRST (down to, say, the top few PCs) and THEN
     cluster on those few PC scores instead of all 6,830 noisy columns.  Crushing away
     the 6,000-plus weak directions throws out noise and makes the clustering steadier.

  ## Now Carve the Crushed Sheet Into a Family Tree

  IN HAND: a sheet of 64 samples x 6,830 genes = 64 x 6,830 = 437,120 cells, crushed by
  PCA down to a few honest shadows we could plot and eyeball.  This section stops looking
  and starts carving: build the family tree on those samples and cut it into groups.

  Now Part 4 on the same sheet.  Standardise the genes, build the dendrogram -- a branching
  tree where each fork's height is the gap at that merge -- (Ward or complete linkage), and
  cut it into a handful of groups:

    height
        |        +-------------------+
        |        |                   |
        |    +---+---+          +-----+-----+
        |    |       |          |           |
        |  leukemia  ...      melanoma    renal-ish
        +--------------------------------------------

  Cut the tree at a chosen height to get, say, 4 groups.  Then -- and only then -- unseal
  the type column and lay the blind groups beside the real types.

  ## Only Now, Unseal the Types: Did the Blind Groups Match?

  IN HAND: 64 samples crushed by PCA, then carved by the family tree into a handful of
  blind groups -- built from gene spread alone, with the type column still sealed.  Now,
  and only now, unseal the types and lay them beside the blind groups to grade the match.

  Make a little table -- blind group down the side, true type across the top -- and count:

                MELANOMA  LEUKEMIA  RENAL  OTHER
    group 1 [      8         0        1      1  ]   <- mostly melanoma
    group 2 [      0         6        0      0  ]   <- pure leukemia
    group 3 [      1         0        7      2  ]   <- mostly renal
    group 4 [      mixed bag of the leftovers     ]

  Read group 1's row across: 8 + 0 + 1 + 1 = 10 samples in that group, and 8 of the 10
  are melanoma -- 8 / 10 = 0.8, four-fifths pure.  Group 2's row: 0 + 6 + 0 + 0 = 6
  samples, all leukemia -- 6 / 6 = 1, dead pure.

  >> YOUR TURN
     Grade group 3 the same way.  Add its row across, then take the melanoma-... no,
     take its biggest pile over its total.

     check your slate:  row is 1, 0, 7, 2;  total = 1 + 0 + 7 + 2 = 10;  biggest pile
     is renal at 7;  purity = 7 / 10 = 0.7.  Group 3 is seven-tenths renal -- cleaner
     than a coin-toss muddle, dirtier than the pure leukemia group.

  Some groups come out almost pure -- leukemia samples cluster tightly together.  Others
  are a muddle.  That is the real-world result: unsupervised tools recover SOME of the true
  structure, not all of it.  A pure group means those genes really do separate that cancer;
  a muddled group means the gene signal for those types overlaps.

  ** KEY: PEEKING AT LABELS IS A SCORECARD, NOT A STEP
     The cancer types never touched the clustering -- they were sealed away the entire
     time.  We only unsealed them at the very end to GRADE the blind groups.  If you let
     the labels influence the grouping, you are no longer doing unsupervised learning; you
     are just drawing the answer you already had.

  ## Scoring Two Groupings Against Each Other (One Number)

  IN HAND: a sheet of 64 samples x 6,830 genes = 437,120 cells, crushed by PCA and carved
  by the family tree into blind groups we graded by hand (group 1 came out 8 / 10 = 0.8
  melanoma, group 2 came out 6 / 6 = 1 leukemia).  Now run a SECOND tool, K-means, on the
  same sheet, and ask one number: do the two tools' groupings agree?

  Run K-means and the family tree on the same sheet and you get two columns of pile
  numbers.  Did the two tools find the SAME groups?  You cannot just line the columns
  up and check row by row, because the pile numbers are arbitrary STICKERS:

    dot    K-means    family tree
     1        2            1
     2        2            1
     3        0            3

  Every row "disagrees" on the number -- yet the two tools may have built identical
  groups.  K-means called a pile "2" where the tree called the same pile "1".  Asking
  "is 2 == 1?" tests the stickers, not the grouping.

  The fix is to stop comparing names and compare PAIRS instead.  Take two dots at a
  time and ask each tool one sticker-free question: are these two TOGETHER or APART?

    pair (dot 1, dot 2):  K-means 2,2 -> together.  tree 1,1 -> together.  AGREE.
    pair (dot 1, dot 3):  K-means 2,0 -> apart.     tree 1,3 -> apart.     AGREE.

  Walk every pair, count how often the two tools AGREE (both "together" or both
  "apart"), divide by the number of pairs.  That fraction is the raw agreement.

  >> YOUR TURN
     The third pair is (dot 2, dot 3).  Read it off the little table above -- K-means
     gave dot 2 a "2" and dot 3 a "0"; the tree gave dot 2 a "1" and dot 3 a "3".  Do
     the two tools agree on this pair?  And with 3 dots, how many pairs are there in
     all, and what is the raw agreement?

     check your slate:  K-means 2,0 -> apart;  tree 1,3 -> apart;  both apart -> AGREE.
     Pairs among 3 dots: (1,2), (1,3), (2,3) = 3 pairs.  All three AGREE, so raw
     agreement = 3 / 3 = 1.  The two tools built the very same grouping here.

  But raw agreement flatters garbage.  Two blindfolded taggers slapping random pile
  numbers on the dots will still AGREE on most pairs -- because in any grouping most
  pairs land APART, and apart-plus-apart counts as agreement by sheer luck.  So the
  raw fraction sits fat above zero even for nonsense.

  The cure is to slide the zero onto that luck mark:

    raw ruler:        0 ----------- luck ------- 1
    re-zeroed ruler:      (minus) -- 0 --------- 1
                      worse-than-luck  luck      identical groupings

  The re-zeroed score is the ADJUSTED RAND SCORE.  It reads +1 when the two groupings
  are identical, 0 when they agree no better than two blindfolded taggers, and goes
  negative when they agree even LESS than random.

  ** KEY: ONE NUMBER, AND IT DOES NOT CARE WHICH TOOL YOU NAME FIRST
     The score asks "do these two dots get the same fate in both groupings?" -- a
     question that reads the same backwards.  Swapping the two columns leaves every
     pair's together/apart answer untouched, so the score never changes.  It is a
     symmetric agreement number, not "how well did tool B copy tool A".


  ## Why Two Tools, Not One

  PCA and clustering answer different questions and back each other up:

    PCA           -> "draw me the cloud so I can SEE it"   (crush walls, keep the picture)
    clustering    -> "carve the cloud into named piles"     (K-means or the family tree)

  On NCI60 they work as a pair: PCA crushes 6,830 noisy walls to a few honest ones, and
  clustering carves the crushed cloud into groups.  Looking at either alone tells half the
  story; together they show both the shape AND the groupings.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  More columns than rows flips your gut feeling.
       With 6,830 walls and 64 dots, distances bunch up -- everything is roughly
       equally far from everything.  Do not trust a clustering on all raw columns;
       crush with PCA first or the gaps are mostly noise.

    TRIPWIRE 2:  Standardise the genes before any distance.
       Same rule as all of Chapter 6.  A handful of loud genes would otherwise drown
       out thousands of quiet ones.

    TRIPWIRE 3:  Clustering and the real labels need not agree.
       The blind groups are built from gene SPREAD, not from the cancer names.  A
       mismatch is not a bug -- it means the genes do not cleanly separate those
       types.  A match is a pleasant confirmation, not a guarantee.

    TRIPWIRE 4:  Do not feed the labels into the clustering.
       It is tempting to "help" the machine by sneaking the type in.  That destroys
       the whole point.  Cluster blind, grade afterward.

    TRIPWIRE 5:  PCA-then-cluster vs cluster-on-raw give different groups.
       Clustering on the top few PCs (denoised) and clustering on all 6,830 raw
       columns can hand back different piles.  The PCA-first route is usually
       steadier on wide data, but say which one you did.

    TRIPWIRE 6:  K-means and the family tree can disagree here too.
       The two grouping methods need not produce the same piles on the same sheet --
       K-means assumes round blobs; the tree depends on the linkage.  Comparing them
       is itself informative: agreement is a good sign the groups are real.

    TRIPWIRE 7:  Pile numbers are stickers -- never compare them directly.
       Tool A's "pile 2" and tool B's "pile 1" can be the EXACT same dots.  Lining
       the two label columns up and checking "2 == 1?" tests the arbitrary names,
       not the grouping.  Compare PAIRS (together/apart) instead, which is what the
       adjusted Rand score does under the hood.

    TRIPWIRE 8:  Raw agreement flatters random labels; re-zero at chance.
       Most pairs land apart in any grouping, so apart-plus-apart racks up
       "agreement" by luck -- two random taggers still score fat above zero.  The
       adjusted Rand score subtracts that luck baseline, so 0 means chance-level and
       +1 means identical.  Reading the raw fraction makes nonsense look like a match.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    wide sheet (more columns than rows)   high-dimensional / p >> n data
    crush first, then cluster             PCA as a preprocessing / denoising step
    blind groups                          unsupervised cluster assignments
    peek at the sealed types              external validation against ground truth
    pure group                            high-purity cluster
    muddled group                         low-purity / mixed cluster
    side-by-side count table              contingency table / confusion of clusters vs labels
    everything is far from everything     the curse of dimensionality
    together-or-apart, then agree/disagree  pair-counting (the Rand index)
    re-zeroed agreement score (-1..1)     adjusted Rand score (adjusted_rand_score)


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       pd.crosstab(a, b)     -- the side-by-side count table (groups vs true types)
       PCA(n_components=k)    -- crush to k strongest shadows (from Part 2)
       KMeans / linkage       -- the grouping tools (from Parts 3 and 4)
       adjusted_rand_score(a, b) -- do two groupings agree? one number, -1..1

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.cluster import KMeans, AgglomerativeClustering

    # X = (64, 6830) gene matrix ; types = the sealed labels (used only to grade)
    X_scaled = StandardScaler().fit_transform(X)

    # crush first: keep the strongest shadows
    pca = PCA(n_components=5)
    scores = pca.fit_transform(X_scaled)
    print(pca.explained_variance_ratio_.cumsum())   # how much we kept

    # see it: PC1 vs PC2
    plt.scatter(scores[:, 0], scores[:, 1], edgecolor="k")
    plt.xlabel("PC1"); plt.ylabel("PC2"); plt.title("NCI60 crushed to 2D")
    plt.show()

    # cluster on the crushed scores
    km_labels = KMeans(n_clusters=4, n_init=10, random_state=42).fit_predict(scores)
    agg_labels = AgglomerativeClustering(n_clusters=4, linkage="ward").fit_predict(scores)

    # do the two tools agree?  one number, sticker-free (+1 identical, 0 chance)
    from sklearn.metrics import adjusted_rand_score
    print(adjusted_rand_score(km_labels, agg_labels))   # swapping args = same score

    # grade blind groups against the sealed types (the peek)
    print(pd.crosstab(pd.Series(km_labels, name="group"),
                      pd.Series(types,     name="true_type")))


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
    Part 1 -- Looking at a Sheet With No Answers .
    Part 2 -- The Strongest Direction (PCA) .
    Part 3 -- Grouping by Nearest Centre (K-Means) .
    Part 4 -- The Family Tree (Hierarchical Clustering) .
    Part 5 (this post) .
    Part 6 -- Filling the Blanks (Recommender Systems)

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================