The Strongest Direction: Crushing a Many-Wall Room Into a Flat Page (PCA)

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 2 OF 6
  The Strongest Direction: Crushing a Many-Wall Room Into a Flat Page
  Posted: 2026-06-09 . Author: Rahul Rai . Tags: pca, dimensionality-reduction, visualization
  ============================================================================================

  PATH . post 17 of 28
    <- prev:  Chapter 6, Part 1: Looking at a Sheet With No Answers
       next:  Chapter 6, Part 3: Grouping by Nearest Centre ->

  Part 1 measured gaps between states with 3 columns.  Three columns is easy -- you can
  imagine three rulers at right angles.  But what if you have 13 columns?  Or 100?  You
  cannot draw a 13-wall room on flat paper.  The dots live in a space too many-walled to
  picture.

  PCA (Principal Component Analysis) is the trick that CRUSHES that many-wall room down
  to a flat page while keeping the shape of the dots.  The idea is simple: shine a
  flashlight on the cloud of dots from different angles.  The shadow that casts the
  LONGEST spread is the first "principal component."  The next-longest shadow, at a right
  angle to the first, is the second.  Crush to those two shadows and draw what you see.

  ## The Sheet

    alcohol, malic_acid, ash, alcalinity_of_ash, magnesium, total_phenols,
    flavanoids, nonflavanoid_phenols, proanthocyanins, color_intensity,
    hue, od280/od315_of_diluted_wines, proline

    178 wines.  13 chemical measurements.  One row per wine.

      wine    alcohol   malic_acid   ash    ...   proline
      -----   -------   ---------   ----   ...   -------
       1       14.23      1.71      2.43   ...   1065.0
       2       13.20      1.78      2.14   ...   1050.0
       3       13.16      2.36      2.67   ...   1185.0
      ...       ...       ...        ...   ...     ...

  Look at the columns.  Alcohol is ~13.  Proline is ~1000.  Ash is ~2.  The ruler problem
  is even worse here -- a column measured in thousands will dominate a column measured in
  single digits, simply because its raw gaps are bigger, not because it matters more.  First step: standardise every column
  to mean=0, spread=1.  Same ruler for all 13.

  Count the standardising in clerk-steps: 178 wines x 13 columns = 2,314 numbers, each
  costing one subtract and one divide = 2 strokes, so 2,314 x 2 = 4,628 strokes before
  PCA even begins.  Then one PC score per wine costs 13 multiplies + 12 adds = 25
  strokes, and all 178 wines on one component run 178 x 25 = 4,450 strokes.  A room of
  clerks clears the whole crush by lunch; you would still be sharpening your pencil.

  ## The Core Metaphor: Flashlight and Shadows

  Imagine each wine is a dot floating in a 13-wall room.  You cannot draw this room.
  But you CAN shine a flashlight through it and trace the shadow on the wall.

    |  flashlight  |
    |      *       |  13-wall room (invisible)
    |   *     *    |
    | *       *  * |
    |              |
    ----------------
    shadow on the wall

  The shadow flattens the 13 walls into 1 line.  The dots that were far apart in
  the room land far apart on the shadow.  The dots that were close land close.

  Rotate the flashlight.  The shadow gets longer or shorter.  The LONGEST shadow
  -- the one that spreads the dots out the most -- is the **first principal
  component (PC1)**.  The direction that gives this longest shadow is the single
  most informative way to look at the data.

  Rotate 90 degrees from PC1.  The next-longest shadow at that right angle is
  **PC2**.

  Now you have two shadows (PC1 and PC2) at right angles.  Plot dot at
  (PC1 coordinate, PC2 coordinate) for each wine.  That dot on flat paper
  captures MORE of the original structure than any other two-number summary.

  ## A 2-Wall Worked Example (Not 13, So You Can See It)

  Take 2 measurements (alcohol, color_intensity) for 4 wines so you can draw
  the room on paper and see the shadow with your own eyes.

    wine    alcohol    color_intensity
    -----   -------    ---------------
     A        13.0          5.0
     B        13.5          7.0
     C        14.0          4.0
     D        14.5          6.0

  First, standardise (put on same ruler).

    alcohol mean=13.75,  spread=0.65
    color   mean=5.5,    spread=1.29

    A: alcohol z = (13.0-13.75)/0.65 = -1.15,  color z = (5.0-5.5)/1.29 = -0.39
    B: alcohol z = (13.5-13.75)/0.65 = -0.38,  color z = (7.0-5.5)/1.29 =  1.16
    C: alcohol z = (14.0-13.75)/0.65 =  0.38,  color z = (4.0-5.5)/1.29 = -1.16
    D: alcohol z = (14.5-13.75)/0.65 =  1.15,  color z = (6.0-5.5)/1.29 =  0.39

  Then find the longest shadow.  In 2-walled room this means spinning a
  line until the dots spread along it as far as possible.  The answer (by
  formula, not flashlight) is a weighted combination of the two columns:

    PC1 = 0.71 * alcohol_z + 0.71 * color_z

  This is a RECIPE: take 0.71 parts of alcohol score, add 0.71 parts of color
  score.  The resulting number is each wine's PC1 coordinate.

    A: 0.71 * (-1.15) + 0.71 * (-0.39) = -0.82 + -0.28 = -1.09
    B: 0.71 * (-0.38) + 0.71 * (1.16)  = -0.27 +  0.82 =  0.55
    C: 0.71 * (0.38)  + 0.71 * (-1.16) =  0.27 + -0.82 = -0.55
    D: 0.71 * (1.15)  + 0.71 * (0.39)  =  0.82 +  0.28 =  1.09

  PC1 spreads from -1.09 to +1.09.  PC2 is the next shadow at a right angle:

    PC2 = -0.71 * alcohol_z + 0.71 * color_z

    A: -0.71 * (-1.15) + 0.71 * (-0.39) = 0.82 + -0.28 = 0.54
    B: -0.71 * (-0.38) + 0.71 * (1.16)  = 0.27 +  0.82 = 1.09
    C: -0.71 * (0.38)  + 0.71 * (-1.16) = -0.27 + -0.82 = -1.09
    D: -0.71 * (1.15)  + 0.71 * (0.39)  = -0.82 +  0.28 = -0.54

  Plot each wine at (PC1, PC2):

    PC2 ^
     1.0 |                B
        |
     0.5 |    A
        |
     0.0 -----------------------> PC1
        |
    -0.5 |                D
        |
    -1.0 |    C
        |
         -----+----+----+----
           -1.0  -0.5   0   0.5  1.0

  The dots spread more along PC1 than PC2.  PC1 captures the stronger pattern.

  >> YOUR TURN
     A fifth wine E (made-up) lands at alcohol_z = 1.0 and color_z = 1.0.  Work its
     PC1 and PC2 from the two recipes above.

     check your slate:  PC1 = 0.71 * 1.0 + 0.71 * 1.0 = 0.71 + 0.71 = 1.42;
     PC2 = -0.71 * 1.0 + 0.71 * 1.0 = -0.71 + 0.71 = 0.  E sits far out along PC1
     and dead centre on PC2 -- a wine the strongest shadow finds extreme.

  ## How Much Does Each Shadow Capture?

  IN HAND: four wines put on the same ruler, then a recipe PC1 = 0.71*alcohol_z +
  0.71*color_z that spread them from -1.09 to +1.09, and PC2 at a right angle.  This
  section asks how much of the total spread each shadow actually caught.

  PC1 explained fraction = variance_of_PC1_scores / (variance_of_PC1 + variance_of_PC2)

  We already have the PC scores.  Variance = mean of squared values (mean is 0 by
  construction since the data was centred):

    PC1 scores: -1.09,  0.55,  -0.55,  1.09
    PC1 variance = (1.09^2 + 0.55^2 + 0.55^2 + 1.09^2) / 4
                 = (1.19 + 0.30 + 0.30 + 1.19) / 4
                 = 2.98 / 4  = 0.75

    PC2 scores:  0.54,  1.09,  -1.09,  -0.54
    PC2 variance = (0.54^2 + 1.09^2 + 1.09^2 + 0.54^2) / 4
                 = (0.29 + 1.19 + 1.19 + 0.29) / 4
                 = 2.96 / 4  = 0.74

    total = 0.75 + 0.74 = 1.49

    PC1 fraction = 0.75 / 1.49 = 0.50  (50%)
    PC2 fraction = 0.74 / 1.49 = 0.50  (50%)

  With only 2 original columns, each PC naturally carries about half.  The fractions
  get interesting when you have 13 columns and the first shadow eats 36% while the
  12th shadow carries 1%.

  With 2 original columns, each PC carries about half the information.  With
  the real 13-column wine data, PC1 carries ~36%, PC2 carries ~19%, and the
  rest scatters across PCs 3-13.

  ## Why Kept + Lost Always Adds Back to the Same Total

  Here is the lock that makes the fraction trustworthy.  Take one dot and the line
  it casts its shadow on.  Draw three lengths:

  >> YOUR TURN (do this one after reading the three lengths just below)
     A dot's shadow on the line is 4 long, and its perp (the walk off the line to
     the dot) is 3 long.  How long is the stick from the origin to the dot?

     check your slate:  stick^2 = shadow^2 + perp^2 = 4*4 + 3*3 = 16 + 9 = 25, so
     stick = sqrt(25) = 5.  The 3-4-5 triangle: kept (shadow) and lost (perp) always
     square back to the same fixed stick.

    stick    = straight line from the middle (origin) to the dot
    shadow   = how far the dot's shadow lands along the line  (KEPT)
    perp     = the dot's sideways gap off the line            (LOST)

         * dot
        /|
   stick |  perp (off the line)
      /  |
    -+---+-------- the line
     middle  shadow (along the line)

  Those three make a right angle, so Pythagoras locks them:

    stick^2  =  shadow^2  +  perp^2

  The stick was fixed the moment you measured the dot -- spinning the line never
  changes it.  Spinning only shuffles the split between shadow and perp.  Add this
  over all the dots:

    sum of stick^2  =  sum of shadow^2  +  sum of perp^2
    (what the dots ARE)  (what the drawing KEEPS)  (what flattening LOSES)

  The left side never moves.  So the line that KEEPS the most (biggest sum of
  shadow^2) is automatically the line that LOSES the least (smallest sum of perp^2).
  Most-kept and least-lost are the same line seen from two sides -- which is why PC1,
  the longest-shadow direction, is also the smallest-reconstruction-error direction.
  And the kept fraction is just:

    kept fraction  =  sum of shadow^2  /  sum of stick^2

  ## Choosing How Many Shadows to Keep

  Add the fractions from the biggest shadows downward until you hit 80%:

    PC1: 36%   cumulative: 36%
    PC2: 19%   cumulative: 55%
    PC3: 11%   cumulative: 66%
    PC4:  7%   cumulative: 73%
    PC5:  5%   cumulative: 78%
    PC6:  4%   cumulative: 82%  <- past 80%

  With 5 PCs you have ~78%.  With 6 PCs you have ~82%.  Most of the structure
  is captured in the first 5 or 6 shadows.  The remaining 7 PCs carry mostly
  noise.

  ## The Recipe (Loadings)

  Each PC is a RECIPE -- how much of each original column goes into it.

    PC1 = 0.14 * alcohol + 0.16 * malic_acid + ... + 0.32 * proline

  The numbers (loadings) tell you which columns the PC leans on.  A high loading
  means that column is important for that PC.  For PC1 on the wine data, the
  highest loading is often on proline or flavanoids -- these columns vary the
  most and drive the longest shadow.

  ## The Scores (Transform)

  Every wine gets a PC1 score and a PC2 score.  These are the coordinates you
  plot.  The original 13 columns are crushed into 2 numbers -- 1 dot on a flat
  page.

    wine    PC1       PC2
    -----   ------    ------
     1      2.13     -0.45
     2      1.62     -0.85
     3      2.05      0.32
    ...     ...       ...
    178    -2.41      0.78

  The scatter plot of PC1 vs PC2 shows the cloud of wines crushed to 2D.
  Each dot is one wine.  Dots close together = chemically similar wines.

  ## Blowing It Back Up (Reconstruction)

  If you take only PC1 and PC2 scores and multiply back by the loadings,
  you get a BLURRY version of the original 13 columns -- blurry because
  you threw away PCs 3-13.

    original alcohol = 14.23
    reconstructed (2 PCs) ~= 13.85   (off by ~0.4)

  The more PCs you keep, the less blur:

    keep 2 PCs:  MSE ~ 0.45
    keep 5 PCs:  MSE ~ 0.19
    keep all 13: MSE = 0.00 (perfect, but pointless)

  Now line those numbers up against the kept fractions from earlier:

    keep 2 PCs:  kept ~ 55%  ->  lost ~ 45%  ->  MSE ~ 0.45
    keep 5 PCs:  kept ~ 80%  ->  lost ~ 20%  ->  MSE ~ 0.19

  Not a coincidence.  On standardised data the reconstruction MSE IS the
  thrown-away fraction -- the same kept + lost = fixed total from the
  stick-shadow-perp section, read from the lost side.  You never need to
  run the reconstruction to know its error: 1 minus the kept fraction
  already told you.

  Reconstruction error measures "how much structure was lost" when you
  crushed the room.  Your goal is to lose as little as possible while
  still being able to draw the picture on flat paper.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  Standardise BEFORE PCA, not after.
       PCA hunts for the direction of greatest SPREAD.  If one column has
       spread 1000 (proline) and another has spread 1 (ash), PCA fixates
       on proline and ignores everything else.  Standardise first or the
       first PC is just "the column with the largest numbers."

    TRIPWIRE 2:  Loadings are NOT correlations.
       The loading tells you the recipe for the PC.  A loading of 0.5 means
       "half a part of this column goes into the PC."  It does NOT mean
       "this column correlates 0.5 with the PC."  Those are different numbers.

    TRIPWIRE 3:  PC1 vs PC2 scatter has near-zero correlation.
       PCA forces every PC to be at RIGHT ANGLES (uncorrelated) with every
       other PC.  If your PC1 and PC2 are correlated, something is wrong
       with the fit or the data.

    TRIPWIRE 4:  More PCs is not always better.
       Keeping all 13 PCs means MSE=0, but you also kept all the noise.
       The point of PCA is to drop the noisy dimensions and keep only the
       strong patterns.  The 80% cumulative threshold is a rule of thumb,
       not a law.

    TRIPWIRE 5:  Reconstruction gives you standardised data, not raw.
       When you inverse_transform from PCA, you get back numbers in the
       standardised space (mean=0, spread=1), not in the original units.
       To get raw units, you need to reverse the standardisation as well.

    TRIPWIRE 6:  explained_variance_ratio_ vs explained_variance_.
       The ratio is the fraction (0 to 1) of total variance.  The raw
       variance is the actual spread value.  The ratio is what you use for
       "how much is captured" and for the cumulative plot.

    TRIPWIRE 7:  The furthest dot from the origin on PC1-PC2 is unusual.
       In the scatter plot, the dot farthest from (0, 0) is the most
       extreme wine in the 2D crushed view.  It might be an outlier or
       just a very distinctive chemical profile.  Worth checking.

    TRIPWIRE 8:  A low kept-fraction means the PICTURE LIES -- do not read groups off it.
       When PC1+PC2 keep only, say, 40% of the spread, 60% died in the
       sideways gaps -- and that lost 60% can hold two dots that are far
       apart in truth but land on the SAME spot in the drawing.  Two real
       islands can print as one, or one as two.  Trust the islands you see
       only when the kept fraction is high (close to 1); a low fraction means
       the flat page is hiding most of the real placement.

    TRIPWIRE 9:  The kept fraction divides by the TOTAL spread, not the dot count.
       The fraction is sum(shadow^2) / sum(stick^2) -- a slice of the spread
       pie over the whole pie.  Dividing by the number of dots instead gives
       "spread per dot", a per-head number that is not a fraction at all and
       will not land between 0 and 1.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    crush a many-wall room to flat        dimensionality reduction
    stick from middle to dot              the centred data vector (its norm)
    shadow along the line (kept)          the PC score / projection
    sideways gap off the line (lost)      the reconstruction residual
    stick^2 = shadow^2 + perp^2           Pythagoras / orthogonal decomposition
    longest shadow / strongest direction  first principal component (PC1)
    second shadow (at right angle)        second principal component (PC2)
    the recipe for a shadow               loadings / components_
    each wine's coordinate on the shadow  score / transformed data
    how much each shadow carries          explained variance ratio (PVE)
    keep shadows until 80% captured       cumulative PVE threshold
    blow the shadow back up               inverse transform / reconstruction
    blurriness after blowing up           reconstruction error (MSE)
    standardise before crushing           StandardScaler before PCA
    column importance in the recipe       loading magnitude (absolute value)
    the 13 numbers crunched into 2        2D embedding / projection


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       PCA(n_components=...)     -- the shadow-finder machine
       .fit(X)                   -- learn the recipes (loadings)
       .transform(X)             -- get each wine's PC coordinates (scores)
       .inverse_transform(X)     -- blow the shadow back up to original dims
       .components_              -- the loadings (recipe per PC)
       .explained_variance_ratio_ -- fraction per PC

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.metrics import mean_squared_error

    # load
    df = pd.read_csv("wine.csv")
    print(df.shape)        # (178, 13)

    # standardise
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df)

    # fit PCA (keep all to see how much each PC carries)
    pca = PCA()
    pca.fit(X_scaled)

    # variance explained
    pve = pca.explained_variance_ratio_
    cum_pve = np.cumsum(pve)
    print(f"PC1: {pve[0]:.3f}, PC2: {pve[1]:.3f}")
    print(f"First 2 PCs capture: {cum_pve[1]:.3f}")

    # how many to reach 80%?
    n_80 = int(np.argmax(cum_pve >= 0.80) + 1)
    print(f"Need {n_80} PCs for 80%")

    # loadings (the recipe)
    loadings = pca.components_
    top_feat = df.columns[np.argmax(np.abs(loadings[0]))]
    print(f"PC1 leans hardest on: {top_feat}")

    # transform (scores)
    scores = pca.transform(X_scaled)
    scores_df = pd.DataFrame(scores,
                  columns=[f"PC{i+1}" for i in range(13)])

    # plot PC1 vs PC2
    plt.figure(figsize=(7, 6))
    plt.scatter(scores[:, 0], scores[:, 1], alpha=0.7, edgecolor="k")
    plt.axhline(0, color="gray", lw=1)
    plt.axvline(0, color="gray", lw=1)
    plt.xlabel(f"PC1 ({pve[0]*100:.1f}%)")
    plt.ylabel(f"PC2 ({pve[1]*100:.1f}%)")
    plt.title("Wine: PC1 vs PC2")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.show()

    # furthest from origin
    dist = np.sqrt(scores[:, 0]**2 + scores[:, 1]**2)
    print(f"Furthest wine (row index): {dist.argmax()}")

    # reconstruction
    def reconstruction_error(k):
        pca_k = PCA(n_components=k)
        s = pca_k.fit_transform(X_scaled)
        recon = pca_k.inverse_transform(s)
        return round(mean_squared_error(X_scaled, recon), 4)

    print(f"Recon MSE with 2 PCs: {reconstruction_error(2)}")
    print(f"Recon MSE with 5 PCs: {reconstruction_error(5)}")
    print(f"Recon MSE with all PCs: {reconstruction_error(13)}")


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
    Part 1 -- Looking at a Sheet With No Answers .
    Part 2 (this post) .
    Part 3 -- Grouping by Nearest Centre (K-Means) .
    Part 4 -- The Family Tree (Hierarchical Clustering) .
    Part 5 -- Both Tools on NCI60 (Re-visited) .
    Part 6 -- Filling the Blanks (Recommender Systems)

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================