==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 2 OF 6
The Strongest Direction: Crushing a Many-Wall Room Into a Flat Page
Posted: 2026-06-09 . Author: Rahul Rai . Tags: pca, dimensionality-reduction, visualization
============================================================================================
PATH . post 17 of 28
<- prev: Chapter 6, Part 1: Looking at a Sheet With No Answers
next: Chapter 6, Part 3: Grouping by Nearest Centre ->
Part 1 measured gaps between states with 3 columns. Three columns is easy -- you can
imagine three rulers at right angles. But what if you have 13 columns? Or 100? You
cannot draw a 13-wall room on flat paper. The dots live in a space too many-walled to
picture.
PCA (Principal Component Analysis) is the trick that CRUSHES that many-wall room down
to a flat page while keeping the shape of the dots. The idea is simple: shine a
flashlight on the cloud of dots from different angles. The shadow that casts the
LONGEST spread is the first "principal component." The next-longest shadow, at a right
angle to the first, is the second. Crush to those two shadows and draw what you see.
## The Sheet
alcohol, malic_acid, ash, alcalinity_of_ash, magnesium, total_phenols,
flavanoids, nonflavanoid_phenols, proanthocyanins, color_intensity,
hue, od280/od315_of_diluted_wines, proline
178 wines. 13 chemical measurements. One row per wine.
wine alcohol malic_acid ash ... proline
----- ------- --------- ---- ... -------
1 14.23 1.71 2.43 ... 1065.0
2 13.20 1.78 2.14 ... 1050.0
3 13.16 2.36 2.67 ... 1185.0
... ... ... ... ... ...
Look at the columns. Alcohol is ~13. Proline is ~1000. Ash is ~2. The ruler problem
is even worse here -- a column measured in thousands will dominate a column measured in
single digits, simply because its raw gaps are bigger, not because it matters more. First step: standardise every column
to mean=0, spread=1. Same ruler for all 13.
Count the standardising in clerk-steps: 178 wines x 13 columns = 2,314 numbers, each
costing one subtract and one divide = 2 strokes, so 2,314 x 2 = 4,628 strokes before
PCA even begins. Then one PC score per wine costs 13 multiplies + 12 adds = 25
strokes, and all 178 wines on one component run 178 x 25 = 4,450 strokes. A room of
clerks clears the whole crush by lunch; you would still be sharpening your pencil.
## The Core Metaphor: Flashlight and Shadows
Imagine each wine is a dot floating in a 13-wall room. You cannot draw this room.
But you CAN shine a flashlight through it and trace the shadow on the wall.
| flashlight |
| * | 13-wall room (invisible)
| * * |
| * * * |
| |
----------------
shadow on the wall
The shadow flattens the 13 walls into 1 line. The dots that were far apart in
the room land far apart on the shadow. The dots that were close land close.
Rotate the flashlight. The shadow gets longer or shorter. The LONGEST shadow
-- the one that spreads the dots out the most -- is the **first principal
component (PC1)**. The direction that gives this longest shadow is the single
most informative way to look at the data.
Rotate 90 degrees from PC1. The next-longest shadow at that right angle is
**PC2**.
Now you have two shadows (PC1 and PC2) at right angles. Plot dot at
(PC1 coordinate, PC2 coordinate) for each wine. That dot on flat paper
captures MORE of the original structure than any other two-number summary.
## A 2-Wall Worked Example (Not 13, So You Can See It)
Take 2 measurements (alcohol, color_intensity) for 4 wines so you can draw
the room on paper and see the shadow with your own eyes.
wine alcohol color_intensity
----- ------- ---------------
A 13.0 5.0
B 13.5 7.0
C 14.0 4.0
D 14.5 6.0
First, standardise (put on same ruler).
alcohol mean=13.75, spread=0.65
color mean=5.5, spread=1.29
A: alcohol z = (13.0-13.75)/0.65 = -1.15, color z = (5.0-5.5)/1.29 = -0.39
B: alcohol z = (13.5-13.75)/0.65 = -0.38, color z = (7.0-5.5)/1.29 = 1.16
C: alcohol z = (14.0-13.75)/0.65 = 0.38, color z = (4.0-5.5)/1.29 = -1.16
D: alcohol z = (14.5-13.75)/0.65 = 1.15, color z = (6.0-5.5)/1.29 = 0.39
Then find the longest shadow. In 2-walled room this means spinning a
line until the dots spread along it as far as possible. The answer (by
formula, not flashlight) is a weighted combination of the two columns:
PC1 = 0.71 * alcohol_z + 0.71 * color_z
This is a RECIPE: take 0.71 parts of alcohol score, add 0.71 parts of color
score. The resulting number is each wine's PC1 coordinate.
A: 0.71 * (-1.15) + 0.71 * (-0.39) = -0.82 + -0.28 = -1.09
B: 0.71 * (-0.38) + 0.71 * (1.16) = -0.27 + 0.82 = 0.55
C: 0.71 * (0.38) + 0.71 * (-1.16) = 0.27 + -0.82 = -0.55
D: 0.71 * (1.15) + 0.71 * (0.39) = 0.82 + 0.28 = 1.09
PC1 spreads from -1.09 to +1.09. PC2 is the next shadow at a right angle:
PC2 = -0.71 * alcohol_z + 0.71 * color_z
A: -0.71 * (-1.15) + 0.71 * (-0.39) = 0.82 + -0.28 = 0.54
B: -0.71 * (-0.38) + 0.71 * (1.16) = 0.27 + 0.82 = 1.09
C: -0.71 * (0.38) + 0.71 * (-1.16) = -0.27 + -0.82 = -1.09
D: -0.71 * (1.15) + 0.71 * (0.39) = -0.82 + 0.28 = -0.54
Plot each wine at (PC1, PC2):
PC2 ^
1.0 | B
|
0.5 | A
|
0.0 -----------------------> PC1
|
-0.5 | D
|
-1.0 | C
|
-----+----+----+----
-1.0 -0.5 0 0.5 1.0
The dots spread more along PC1 than PC2. PC1 captures the stronger pattern.
>> YOUR TURN
A fifth wine E (made-up) lands at alcohol_z = 1.0 and color_z = 1.0. Work its
PC1 and PC2 from the two recipes above.
check your slate: PC1 = 0.71 * 1.0 + 0.71 * 1.0 = 0.71 + 0.71 = 1.42;
PC2 = -0.71 * 1.0 + 0.71 * 1.0 = -0.71 + 0.71 = 0. E sits far out along PC1
and dead centre on PC2 -- a wine the strongest shadow finds extreme.
## How Much Does Each Shadow Capture?
IN HAND: four wines put on the same ruler, then a recipe PC1 = 0.71*alcohol_z +
0.71*color_z that spread them from -1.09 to +1.09, and PC2 at a right angle. This
section asks how much of the total spread each shadow actually caught.
PC1 explained fraction = variance_of_PC1_scores / (variance_of_PC1 + variance_of_PC2)
We already have the PC scores. Variance = mean of squared values (mean is 0 by
construction since the data was centred):
PC1 scores: -1.09, 0.55, -0.55, 1.09
PC1 variance = (1.09^2 + 0.55^2 + 0.55^2 + 1.09^2) / 4
= (1.19 + 0.30 + 0.30 + 1.19) / 4
= 2.98 / 4 = 0.75
PC2 scores: 0.54, 1.09, -1.09, -0.54
PC2 variance = (0.54^2 + 1.09^2 + 1.09^2 + 0.54^2) / 4
= (0.29 + 1.19 + 1.19 + 0.29) / 4
= 2.96 / 4 = 0.74
total = 0.75 + 0.74 = 1.49
PC1 fraction = 0.75 / 1.49 = 0.50 (50%)
PC2 fraction = 0.74 / 1.49 = 0.50 (50%)
With only 2 original columns, each PC naturally carries about half. The fractions
get interesting when you have 13 columns and the first shadow eats 36% while the
12th shadow carries 1%.
With 2 original columns, each PC carries about half the information. With
the real 13-column wine data, PC1 carries ~36%, PC2 carries ~19%, and the
rest scatters across PCs 3-13.
## Why Kept + Lost Always Adds Back to the Same Total
Here is the lock that makes the fraction trustworthy. Take one dot and the line
it casts its shadow on. Draw three lengths:
>> YOUR TURN (do this one after reading the three lengths just below)
A dot's shadow on the line is 4 long, and its perp (the walk off the line to
the dot) is 3 long. How long is the stick from the origin to the dot?
check your slate: stick^2 = shadow^2 + perp^2 = 4*4 + 3*3 = 16 + 9 = 25, so
stick = sqrt(25) = 5. The 3-4-5 triangle: kept (shadow) and lost (perp) always
square back to the same fixed stick.
stick = straight line from the middle (origin) to the dot
shadow = how far the dot's shadow lands along the line (KEPT)
perp = the dot's sideways gap off the line (LOST)
* dot
/|
stick | perp (off the line)
/ |
-+---+-------- the line
middle shadow (along the line)
Those three make a right angle, so Pythagoras locks them:
stick^2 = shadow^2 + perp^2
The stick was fixed the moment you measured the dot -- spinning the line never
changes it. Spinning only shuffles the split between shadow and perp. Add this
over all the dots:
sum of stick^2 = sum of shadow^2 + sum of perp^2
(what the dots ARE) (what the drawing KEEPS) (what flattening LOSES)
The left side never moves. So the line that KEEPS the most (biggest sum of
shadow^2) is automatically the line that LOSES the least (smallest sum of perp^2).
Most-kept and least-lost are the same line seen from two sides -- which is why PC1,
the longest-shadow direction, is also the smallest-reconstruction-error direction.
And the kept fraction is just:
kept fraction = sum of shadow^2 / sum of stick^2
## Choosing How Many Shadows to Keep
Add the fractions from the biggest shadows downward until you hit 80%:
PC1: 36% cumulative: 36%
PC2: 19% cumulative: 55%
PC3: 11% cumulative: 66%
PC4: 7% cumulative: 73%
PC5: 5% cumulative: 78%
PC6: 4% cumulative: 82% <- past 80%
With 5 PCs you have ~78%. With 6 PCs you have ~82%. Most of the structure
is captured in the first 5 or 6 shadows. The remaining 7 PCs carry mostly
noise.
## The Recipe (Loadings)
Each PC is a RECIPE -- how much of each original column goes into it.
PC1 = 0.14 * alcohol + 0.16 * malic_acid + ... + 0.32 * proline
The numbers (loadings) tell you which columns the PC leans on. A high loading
means that column is important for that PC. For PC1 on the wine data, the
highest loading is often on proline or flavanoids -- these columns vary the
most and drive the longest shadow.
## The Scores (Transform)
Every wine gets a PC1 score and a PC2 score. These are the coordinates you
plot. The original 13 columns are crushed into 2 numbers -- 1 dot on a flat
page.
wine PC1 PC2
----- ------ ------
1 2.13 -0.45
2 1.62 -0.85
3 2.05 0.32
... ... ...
178 -2.41 0.78
The scatter plot of PC1 vs PC2 shows the cloud of wines crushed to 2D.
Each dot is one wine. Dots close together = chemically similar wines.
## Blowing It Back Up (Reconstruction)
If you take only PC1 and PC2 scores and multiply back by the loadings,
you get a BLURRY version of the original 13 columns -- blurry because
you threw away PCs 3-13.
original alcohol = 14.23
reconstructed (2 PCs) ~= 13.85 (off by ~0.4)
The more PCs you keep, the less blur:
keep 2 PCs: MSE ~ 0.45
keep 5 PCs: MSE ~ 0.19
keep all 13: MSE = 0.00 (perfect, but pointless)
Now line those numbers up against the kept fractions from earlier:
keep 2 PCs: kept ~ 55% -> lost ~ 45% -> MSE ~ 0.45
keep 5 PCs: kept ~ 80% -> lost ~ 20% -> MSE ~ 0.19
Not a coincidence. On standardised data the reconstruction MSE IS the
thrown-away fraction -- the same kept + lost = fixed total from the
stick-shadow-perp section, read from the lost side. You never need to
run the reconstruction to know its error: 1 minus the kept fraction
already told you.
Reconstruction error measures "how much structure was lost" when you
crushed the room. Your goal is to lose as little as possible while
still being able to draw the picture on flat paper.
## Common Tripwires I Caught
TRIPWIRE 1: Standardise BEFORE PCA, not after.
PCA hunts for the direction of greatest SPREAD. If one column has
spread 1000 (proline) and another has spread 1 (ash), PCA fixates
on proline and ignores everything else. Standardise first or the
first PC is just "the column with the largest numbers."
TRIPWIRE 2: Loadings are NOT correlations.
The loading tells you the recipe for the PC. A loading of 0.5 means
"half a part of this column goes into the PC." It does NOT mean
"this column correlates 0.5 with the PC." Those are different numbers.
TRIPWIRE 3: PC1 vs PC2 scatter has near-zero correlation.
PCA forces every PC to be at RIGHT ANGLES (uncorrelated) with every
other PC. If your PC1 and PC2 are correlated, something is wrong
with the fit or the data.
TRIPWIRE 4: More PCs is not always better.
Keeping all 13 PCs means MSE=0, but you also kept all the noise.
The point of PCA is to drop the noisy dimensions and keep only the
strong patterns. The 80% cumulative threshold is a rule of thumb,
not a law.
TRIPWIRE 5: Reconstruction gives you standardised data, not raw.
When you inverse_transform from PCA, you get back numbers in the
standardised space (mean=0, spread=1), not in the original units.
To get raw units, you need to reverse the standardisation as well.
TRIPWIRE 6: explained_variance_ratio_ vs explained_variance_.
The ratio is the fraction (0 to 1) of total variance. The raw
variance is the actual spread value. The ratio is what you use for
"how much is captured" and for the cumulative plot.
TRIPWIRE 7: The furthest dot from the origin on PC1-PC2 is unusual.
In the scatter plot, the dot farthest from (0, 0) is the most
extreme wine in the 2D crushed view. It might be an outlier or
just a very distinctive chemical profile. Worth checking.
TRIPWIRE 8: A low kept-fraction means the PICTURE LIES -- do not read groups off it.
When PC1+PC2 keep only, say, 40% of the spread, 60% died in the
sideways gaps -- and that lost 60% can hold two dots that are far
apart in truth but land on the SAME spot in the drawing. Two real
islands can print as one, or one as two. Trust the islands you see
only when the kept fraction is high (close to 1); a low fraction means
the flat page is hiding most of the real placement.
TRIPWIRE 9: The kept fraction divides by the TOTAL spread, not the dot count.
The fraction is sum(shadow^2) / sum(stick^2) -- a slice of the spread
pie over the whole pie. Dividing by the number of dots instead gives
"spread per dot", a per-head number that is not a fraction at all and
will not land between 0 and 1.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
crush a many-wall room to flat dimensionality reduction
stick from middle to dot the centred data vector (its norm)
shadow along the line (kept) the PC score / projection
sideways gap off the line (lost) the reconstruction residual
stick^2 = shadow^2 + perp^2 Pythagoras / orthogonal decomposition
longest shadow / strongest direction first principal component (PC1)
second shadow (at right angle) second principal component (PC2)
the recipe for a shadow loadings / components_
each wine's coordinate on the shadow score / transformed data
how much each shadow carries explained variance ratio (PVE)
keep shadows until 80% captured cumulative PVE threshold
blow the shadow back up inverse transform / reconstruction
blurriness after blowing up reconstruction error (MSE)
standardise before crushing StandardScaler before PCA
column importance in the recipe loading magnitude (absolute value)
the 13 numbers crunched into 2 2D embedding / projection
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
PCA(n_components=...) -- the shadow-finder machine
.fit(X) -- learn the recipes (loadings)
.transform(X) -- get each wine's PC coordinates (scores)
.inverse_transform(X) -- blow the shadow back up to original dims
.components_ -- the loadings (recipe per PC)
.explained_variance_ratio_ -- fraction per PC
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
# load
df = pd.read_csv("wine.csv")
print(df.shape) # (178, 13)
# standardise
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# fit PCA (keep all to see how much each PC carries)
pca = PCA()
pca.fit(X_scaled)
# variance explained
pve = pca.explained_variance_ratio_
cum_pve = np.cumsum(pve)
print(f"PC1: {pve[0]:.3f}, PC2: {pve[1]:.3f}")
print(f"First 2 PCs capture: {cum_pve[1]:.3f}")
# how many to reach 80%?
n_80 = int(np.argmax(cum_pve >= 0.80) + 1)
print(f"Need {n_80} PCs for 80%")
# loadings (the recipe)
loadings = pca.components_
top_feat = df.columns[np.argmax(np.abs(loadings[0]))]
print(f"PC1 leans hardest on: {top_feat}")
# transform (scores)
scores = pca.transform(X_scaled)
scores_df = pd.DataFrame(scores,
columns=[f"PC{i+1}" for i in range(13)])
# plot PC1 vs PC2
plt.figure(figsize=(7, 6))
plt.scatter(scores[:, 0], scores[:, 1], alpha=0.7, edgecolor="k")
plt.axhline(0, color="gray", lw=1)
plt.axvline(0, color="gray", lw=1)
plt.xlabel(f"PC1 ({pve[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({pve[1]*100:.1f}%)")
plt.title("Wine: PC1 vs PC2")
plt.grid(True, linestyle="--", alpha=0.4)
plt.show()
# furthest from origin
dist = np.sqrt(scores[:, 0]**2 + scores[:, 1]**2)
print(f"Furthest wine (row index): {dist.argmax()}")
# reconstruction
def reconstruction_error(k):
pca_k = PCA(n_components=k)
s = pca_k.fit_transform(X_scaled)
recon = pca_k.inverse_transform(s)
return round(mean_squared_error(X_scaled, recon), 4)
print(f"Recon MSE with 2 PCs: {reconstruction_error(2)}")
print(f"Recon MSE with 5 PCs: {reconstruction_error(5)}")
print(f"Recon MSE with all PCs: {reconstruction_error(13)}")
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
Part 1 -- Looking at a Sheet With No Answers .
Part 2 (this post) .
Part 3 -- Grouping by Nearest Centre (K-Means) .
Part 4 -- The Family Tree (Hierarchical Clustering) .
Part 5 -- Both Tools on NCI60 (Re-visited) .
Part 6 -- Filling the Blanks (Recommender Systems)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================