==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 1 OF 6
Looking at a Sheet With No Answers: Means, Distance, and the Ruler Problem
Posted: 2026-06-09 . Author: Rahul Rai . Tags: eda, distance, standardization, unsupervised
============================================================================================
PATH . post 16 of 28 (start of the unsupervised-learning chapter)
<- prev: Chapter 5, Part 3: Committees
next: Chapter 6, Part 2: The Strongest Direction ->
Every post so far had an answer column to check against -- house price, car mileage,
sick or well, diabetes score. This post has NO answer column at all. Just 50 states
and 3 measurements. No right answer means no "did the machine get it right?" Instead
the question is: which states look like each other? Which ones are strange?
The tools are simple: mean, spread, and a ruler that measures the gap between two rows.
The trap is also simple: one column measured in hundreds shouts louder than one measured
in tens, and the gap gets hijacked by the loudest column unless you put every column on
the SAME ruler first.
## The Sheet
State Murder Assault UrbanPop
------- ------ ------- --------
Alabama 13.2 236 58
Alaska 10.0 263 48
Arizona 8.1 294 80
Arkansas 8.8 190 50
California 9.0 276 91
Colorado 7.9 204 78
... ... ... ...
Wyoming 6.8 161 60
50 states (rows). 3 measurements (columns). Zero answer columns.
Murder = murder arrests per 100,000 people
Assault = assault arrests per 100,000 people
UrbanPop = percent of people living in cities
No right answer. The goal: find which states are most alike.
## The Middle and the Spread
For each column, compute two numbers -- the middle (mean) and the spread
(standard deviation). Add up all 50 values, divide by 50 = mean.
For the spread: measure each value's gap from the mean, square it, average
the squares, take the square root.
With the full sheet in front of you (all 50 states):
column mean spread
------- ----- -------
Murder 7.79 4.36
Assault 170.76 83.34
UrbanPop 65.54 14.47
Assault's mean is ~171 -- twenty-two times larger than Murder's mean (170.76 / 7.79
~= 21.9). Its spread is ~83 -- also twenty times larger (83.34 / 4.36 ~= 19.1).
Already a warning light: Assault's numbers are much bigger. It will dominate any
gap measurement unless we intervene.
## The Ruler Problem
Look at the ranges (max - min):
Murder: 17.4 - 0.8 = 16.6
Assault: 337 - 45 = 292
UrbanPop: 91 - 32 = 59
Assault has the widest range by far. When you measure "how far apart are two
states", the Assault gap can be up to 290 units, while the Murder gap is at most 16
units. The Assault column shouts 10 times louder than Murder -- and the slate says
even more: 292 / 16.6 ~= 17.6. The gap number mostly reflects Assault, not the
full picture.
## The Gap Between Two States: Two Rulers
IN HAND: a sheet of 50 states x 3 columns, a middle and a spread for each column,
and a warning: Assault's range (337 - 45 = 292) dwarfs Murder's (17.4 - 0.8 = 16.6).
This section adds the ruler itself -- two ways to measure the gap between two rows.
You have two states with three numbers each. How far apart are they?
--- STRAIGHT-LINE GAP (Euclidean distance) ---
gap = sqrt( (m1 - m2)^2 + (a1 - a2)^2 + (u1 - u2)^2 )
Square each difference so negative and positive gaps don't cancel. Add them.
Take the square root to get back to the original units.
--- CITY-BLOCK GAP (Manhattan distance) ---
gap = |m1 - m2| + |a1 - a2| + |u1 - u2|
No squaring. No square root. Just absolute differences added up.
Called "city-block" because it measures like walking around a grid
instead of cutting diagonally through the block.
## A Worked Example: Alabama vs Alaska, by Hand
Alabama: Murder=13.2 Assault=236 UrbanPop=58
Alaska: Murder=10.0 Assault=263 UrbanPop=48
STRAIGHT-LINE:
Murder gap: 13.2 - 10.0 = 3.2 squared = 3.2 * 3.2 = 10.24
Assault gap: 236 - 263 = -27 squared = 27 * 27 = 729
UrbanPop gap: 58 - 48 = 10 squared = 10 * 10 = 100
sum of squares: 10.24 + 729 + 100 = 839.24
sqrt: sqrt(839.24) ~= 28.97
Straight-line gap = 28.97
CITY-BLOCK:
Murder gap: |13.2 - 10.0| = 3.2
Assault gap: |236 - 263| = 27
UrbanPop gap: |58 - 48| = 10
sum: 3.2 + 27 + 10 = 40.2
City-block gap = 40.2
The city-block gap is always larger than the straight-line gap for the same
two points (the diagonal is shorter than the L-shaped walk).
>> YOUR TURN
Two more states (made-up): X has Murder 6.0, Assault 200, UrbanPop 50; Y has
Murder 9.0, Assault 240, UrbanPop 60. Work BOTH gaps on the slate.
check your slate: STRAIGHT-LINE -- gaps 6 - 9 = -3 (sq 9), 200 - 240 = -40
(sq 1600), 50 - 60 = -10 (sq 100); sum 9 + 1600 + 100 = 1709; sqrt(1709)
~= 41.3. CITY-BLOCK -- 3 + 40 + 10 = 53. City-block (53) tops straight-line
(41.3), as the L-walk always does -- and Assault's 1600 drowns the other two.
## The Sheet of Gaps
Now repeat the straight-line gap for ALL pairs of states. 50 states -> 50x50
sheet where each cell holds the gap between state i and state j:
AL AK AZ AR CA CO ...
AL [ 0 28.97 ... ... ... ... ]
AK [28.97 0 ... ... ... ... ]
AZ [ ... ... 0 ... ... ... ]
AR [ ... ... ... 0 ... ... ]
CA [ ... ... ... ... 0 ... ]
CO [ ... ... ... ... ... 0 ]
...
Diagonal = 0 (a state is zero distance from itself).
Symmetric: AL-AK = AK-AL.
This sheet IS the foundation of clustering. Every grouping method starts
from some version of "who is closest to whom."
## The Same Ruler: Standardisation
The gap between Alabama and Alaska (28.97) is heavily influenced by Assault
because Assault's numbers are huge (236 vs 263 = 27 gap). The Murder gap
(3.2) barely registers.
Fix: put every column on the SAME ruler before measuring gaps.
standardised value = (raw value - column mean) / column spread
Murder: mean=7.79, spread=4.36
Assault: mean=170.76, spread=83.34
UrbanPop: mean=65.54, spread=14.47
Alabama after standardisation:
Murder z: (13.2 - 7.79) / 4.36 = 5.41 / 4.36 = 1.24
Assault z: (236 - 170.76) / 83.34 = 65.24 / 83.34 = 0.78
UrbanPop z: (58 - 65.54) / 14.47 = -7.54 / 14.47 = -0.52
Alaska after standardisation:
Murder z: (10.0 - 7.79) / 4.36 = 2.21 / 4.36 = 0.51
Assault z: (263 - 170.76) / 83.34 = 92.24 / 83.34 = 1.11
UrbanPop z: (48 - 65.54) / 14.47 = -17.54 / 14.47 = -1.21
Now every column has mean ~0 and spread ~1. No column shouts louder.
Standardised Alabama:
Murder: 1.24 Assault: 0.78 UrbanPop: -0.52
Standardised Alaska:
Murder: 0.51 Assault: 1.11 UrbanPop: -1.21
## The Fair Gap (After Same-Ruler)
Straight-line gap on standardised numbers:
Murder gap: 1.24 - 0.51 = 0.73 squared = 0.73 * 0.73 = 0.53
Assault gap: 0.78 - 1.11 = -0.33 squared = 0.33 * 0.33 = 0.11
UrbanPop: -0.52 - (-1.21) = 0.69 squared = 0.69 * 0.69 = 0.48
sum of squares: 0.53 + 0.11 + 0.48 = 1.12
sqrt: sqrt(1.12) ~= 1.06
Fair gap = 1.06
The raw gap was 28.97. The fair gap is 1.06. The raw gap was dominated by
Assault's units (hundreds). The fair gap treats every column equally.
>> YOUR TURN
A state (made-up) has Murder = 16.0. Murder's mean is 7.79 and its spread is
4.36. Put that one number on the same ruler.
check your slate: z = (16.0 - 7.79) / 4.36 = 8.21 / 4.36 ~= 1.88. The state
sits about 1.88 spreads ABOVE the average murder rate -- a high-murder state,
now in plain ruler-units any other column can be compared against.
## The Closest Pair
Build the 50x50 fair gap sheet (standardised distances instead of raw).
Find the smallest non-zero entry.
The two states with the smallest gap = most similar state-pair. (Which pair?
That is for you to look up from the full sheet.)
## Ask the Closest Neighbors
For any state, you can find its k nearest neighbors -- the k states with the
smallest fair gaps to it. For California's 3 nearest neighbors, your paper
would list the three states with the smallest standardised gaps to California.
This is the same "ask the closest rows" idea from Chapter 1 (k-nearest
neighbours), but with NO answer column. You just list who is nearby.
That is the core of unsupervised learning -- looking at neighbours without
a right answer to check.
## Common Tripwires I Caught
TRIPWIRE 1: The raw distance is not "wrong," just unfair.
The raw Alabama-Alaska gap (28.97) is mathematically correct but
practically misleading -- Assault's large units dominate it.
Standardisation is not optional for distance-based methods.
TRIPWIRE 2: ddof=0 vs ddof=1 for spread.
The assignment says ddof=0 (population standard deviation).
This matches sklearn's StandardScaler. ddof=1 gives a slightly
different number (sample standard deviation) that does not match
the scaler. When standardising, the formula must agree.
TRIPWIRE 3: Missing index_col=0 when loading.
If you forget index_col=0, the state names become a regular column
instead of row labels. The shape is still (50, 4) instead of (50, 3)
and the distance calculations include the state name column.
TRIPWIRE 4: Euclidean vs Manhattan -- both are valid, different uses.
Euclidean is the default but Manhattan handles outliers better
(no squaring means no single large gap dominates). Neither is
"right" or "wrong" -- they measure different kinds of distance.
TRIPWIRE 5: The diagonal is zero but the minimum should not be zero.
When searching for the closest pair, you must exclude the diagonal
(state-to-self distance = 0). Set diagonal entries to infinity
before searching for the minimum.
TRIPWIRE 6: Manhattan >= Euclidean for the same two points.
The straight-line diagonal is the shortest path. City-block (walking
the grid) is always longer or equal. If your Manhattan is smaller
than Euclidean, you made an arithmetic mistake.
TRIPWIRE 7: Standardising changes the relative ordering of distances.
Two states that were closest in raw numbers might not be closest
after standardisation. A column with large range dominated the
raw gap; after standardisation, a different column might matter more.
The ordering changes. That is expected, not a bug.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
sheet with no answer column unlabelled data / unsupervised learning
middle of a column mean / average
spread of a column standard deviation (std / sigma)
range (max - min) feature range
straight-line gap Euclidean distance (L2)
city-block gap Manhattan distance (L1)
sheet of gaps pairwise distance matrix
same ruler / standardised z-score / StandardScaler
fair gap standardised Euclidean distance
closest pair (smallest gap) nearest neighbours
closest k states to anchor k-nearest neighbours (KNN lookup)
column feature / variable
row observation / sample
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
pd.read_csv() -- load a sheet from a .csv file
df.shape -- (rows, columns)
df.columns.tolist() -- list of column names
df.mean() -- mean of every column
df.std(ddof=0) -- population standard deviation
df.max() - df.min() -- range per column
StandardScaler() -- put every column on mean=0, spread=1
np.sqrt() -- square root
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# load the sheet
df = pd.read_csv("us_arrests.csv", index_col=0)
print(df.shape) # (50, 3)
print(df.columns.tolist()) # ['Murder', 'Assault', 'UrbanPop']
# summary statistics
means = df.mean().round(2).to_dict()
stds = df.std(ddof=0).round(2).to_dict()
ranges = (df.max() - df.min()).round(2).to_dict()
# Euclidean distance function
def euclidean_distance(x, y):
return float(np.sqrt(np.sum((x - y) ** 2)).round(3))
# Manhattan distance function
def manhattan_distance(x, y):
return float(np.sum(np.abs(x - y)).round(3))
# distance between Alabama and Alaska
al = df.loc["Alabama"]
ak = df.loc["Alaska"]
print(euclidean_distance(al, ak)) # 28.97
print(manhattan_distance(al, ak)) # 40.2
# standardise
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled, index=df.index, columns=df.columns)
# fair distance matrix (50x50)
n = len(df_scaled)
dist_mat = np.zeros((n, n))
for i in range(n):
for j in range(n):
dist_mat[i, j] = euclidean_distance(df_scaled.iloc[i],
df_scaled.iloc[j])
# closest pair (exclude diagonal)
np.fill_diagonal(dist_mat, np.inf)
min_idx = np.unravel_index(dist_mat.argmin(), dist_mat.shape)
pair = (df_scaled.index[min_idx[0]], df_scaled.index[min_idx[1]])
# k-nearest neighbours
def find_k_nearest(state_name, k):
idx = df_scaled.index.get_loc(state_name)
row = dist_mat[idx].copy()
closest = np.argsort(row)[:k]
return [df_scaled.index[i] for i in closest]
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
Part 1 (this post) .
Part 2 -- The Strongest Direction (PCA) .
Part 3 -- Grouping by Nearest Centre (K-Means) .
Part 4 -- The Family Tree (Hierarchical Clustering) .
Part 5 -- Both Tools on NCI60 (Re-visited) .
Part 6 -- Filling the Blanks (Recommender Systems)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================