==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 1 . PREDICTING HOUSE PRICES . PART 2 OF 3
Ask the Closest Rows: Gap Formula, Same-Ruler Rule, and How Slow It Gets
Posted: 2026-06-01 . Author: Rahul Rai . Tags: knn, distance, non-parametric, complexity
============================================================================================
PATH . post 2 of 28
<- prev: House Prices 1: Guessing Prices End to End
next: House Prices 3: The Straight-Stick Rule ->
Here is the oldest trick in the book, and one you already use every day: to guess
something about a stranger, look at the people most like them. Want to know what a house
is worth? Find the houses most similar and see what they sold for. That is the whole
idea, and it barely needs a machine.
The ask-closest rule guesses by averaging the k closest working-pile right-answers.
Nothing is built ahead of time -- the whole pile is kept on a shelf and looked up each
time. Three honest steps -- measure the gap, pick the closest, average their answers.
But two quiet details decide whether it works at all: put every column on the same
ruler, and choose k with care. Get those wrong and the rule falls apart; get them right
and it is hard to beat.
## Working Pile and Hidden Pile
WORKING PILE (rows with answers) HIDDEN PILE (answers withheld)
rooms income age pop | price rooms income age pop | price
----------------------------- -----------------------------
4 3.2 6 320 | 3.4 3 3.0 5 410 | ?
8 1.5 1 880 | 1.5 ... | ?
Gaps are measured from the measured columns only. Hidden right-answers are read once,
at scoring -- never during guessing.
## Measuring the Gap
Straight-line gap over all measured columns:
gap(a, b) = sqrt( sum_j (a_j - b_j)^2 )
This is p = 2 of the general rule ( sum_j |a_j - b_j|^p )^(1/p). At p = 1 it counts
straight city-block steps (add up the plain gaps column by column, no squaring or root --
like walking a grid of streets, never cutting diagonally). Worked pair -- new row (3, 2, 5, 4):
row A: (4, 2, 6, 3) gaps (1, 0, 1, 1) gap = sqrt(1+0+1+1) = sqrt(3) ~= 1.73
row B: (8, 9, 1, 9) gaps (5, 7, 4, 5) gap = sqrt(25+49+16+25)= sqrt(115) ~= 10.72
>> YOUR TURN
Same new row (3, 2, 5, 4). A row C sits at (3, 4, 5, 2) (made-up). Compute its
gap before reading on.
check your slate: gaps 3-3 = 0, 2-4 = -2, 5-5 = 0, 4-2 = 2; squares 0, 4, 0, 4;
sum 0 + 4 + 0 + 4 = 8; gap = sqrt(8) ~= 2.83. Row C lands between A (sqrt(3) ~=
1.73) and B (sqrt(115) ~= 10.72) -- a middling neighbour, nowhere near the closest.
## Put Columns on the Same Ruler First
IN HAND: one gap rule -- subtract column by column, square, add, root -- worked on a
new row (3, 2, 5, 4): row A sat at sqrt(1+0+1+1) = sqrt(3) ~= 1.73, row B at
sqrt(25+49+16+25) = sqrt(115) ~= 10.72, so A is the neighbour. This section adds the
trap hiding inside that sum: columns measured on different rulers.
Here is the trap that sinks most first attempts. The gap adds up squared per-column
differences -- and a column measured in thousands (people) simply drowns out one
measured in single digits (rooms). Not because population matters more, but because its
numbers are bigger. An accident of units masquerades as importance. On the raw
California pile, the proud "8-column" rule is secretly a 1-column rule that only ever
listens to population.
raw columns:
rooms ~ 2 - 12 <- tiny gaps
people ~ 100 - 35000 <- huge gaps <- this column runs the whole show
Fix: shift each column to zero average and stretch it to spread of 1, using numbers
from the working pile only:
z_j = (x_j - mu_j) / sd_j
A concrete 3-row, 2-column example, by pencil. Working pile has 3 rows:
row rooms (x1) people (x2)
------------------------------------
P 4 320
Q 5 210
R 6 890
Column averages (mu):
mu1 = (4 + 5 + 6) / 3 = 15/3 = 5
mu2 = (320 + 210 + 890) / 3 = 1420/3 ~ 473.3
Column spreads (sd -- how far each column's numbers typically sit from its average):
sd1 = sqrt( ((4-5)^2 + (5-5)^2 + (6-5)^2) / 3 )
= sqrt( (1 + 0 + 1) / 3 ) = sqrt(2/3) ~ 0.816
sd2 = sqrt( ((320-473.3)^2 + (210-473.3)^2 + (890-473.3)^2) / 3 )
= sqrt( (23509 + 69389 + 173609) / 3 )
= sqrt( 266507 / 3 ) = sqrt(88836) ~ 298.1
Now shift each row:
row P: z1 = (4 - 5) / 0.816 = -1.225
z2 = (320 - 473.3) / 298.1 = -0.514
row Q: z1 = (5 - 5) / 0.816 = 0
z2 = (210 - 473.3) / 298.1 = -0.884
row R: z1 = (6 - 5) / 0.816 = 1.225
z2 = (890 - 473.3) / 298.1 = 1.398
After scaling, every column is centred at 0 and stretches to roughly -1.4 to +1.4.
Neither column dominates the gap by accident of its raw unit size.
Compute mu_j, sd_j on the working pile; apply the same constants to the hidden pile.
After this, every column contributes equally to the gap. (Squeezing to [0,1] instead of
unit spread is an alternative that works the same way.)
!! WARN: LEARN THE RULER ON THE WORKING PILE ONLY
Computing averages and spreads on the whole pile before splitting leaks hidden-pile
numbers into the build step. Learn on working, apply to both. Inside rotating folds,
relearn per round -- the bound pipeline does this.
## Guessing One Row
1. measure gap to every working-pile row O(n*d)
2. pick the k smallest gaps O(n) with a heap
3. average those k right-answers -> guess
An optional refinement weights closer rows more -- inverse-gap weights w_i = 1/gap_i:
guess = ( sum_i w_i * y_i ) / ( sum_i w_i )
Equal weights (w_i = 1) recover the plain average.
## How Slow Is It
The ask-closest rule is generous at study time and stingy at guess time -- it does no
work up front, then pays for it on every single question. The pile is stored, never
compressed, so every guess recomputes all gaps from scratch.
100 rows = 80 working + 20 hidden
work per hidden row = 80 * d gap computations
total work = 20 * 80 * d
time grows like people times columns
Count that bill in pencil strokes. In the toy above, one gap over d = 4 columns costs
4 subtractions, 4 squarings, 3 additions, 1 root = 12 strokes; 20 hidden rows x 80
working rows = 1,600 gaps, and 1,600 x 12 = 19,200 strokes -- one clerk's morning. On
the full pile (working = 20,640 x 8/10 = 16,512 rows; hidden = 20,640 - 16,512 = 4,128),
one guess over 8 columns costs 8 + 8 + 7 + 1 = 24 strokes per row, 16,512 x 24 = 396,288
in all -- and grading every hidden row runs to 4,128 x 396,288 = 1,635,876,864 strokes.
The room of clerks does NOT finish by lunch.
Sorting shelves (KD-tree, Ball-tree) cut average guess time to roughly
"time grows like columns times log of people" when
columns are few, but degrade toward brute force as columns pile up. Here is why:
few columns: nearby rows are truly nearby <- sorting shelf helps
many columns: all rows sit about the same gap away
"closest" stops meaning anything
## Worked by Hand
IN HAND: the gap rule (subtract column by column, square, add, root) and the same-ruler
fix z = (x - mu) / sd, with mu and sd learned on the working pile only. Each guess
costs one gap per working row, then the average of the k closest right-answers. This
section adds the whole rule, run end to end on a pile small enough for one slate.
One column (income) only, so the same-ruler step is moot here. New row at income 5.0,
k = 3:
working pile (income, price):
(5.6, 3.4) (4.0, 2.7) (3.8, 2.4) (2.5, 1.5) (7.3, 3.5)
gaps from 5.0:
|5.0 - 5.6| = 0.6 <- |5.0 - 4.0| = 1.0 <- |5.0 - 3.8| = 1.2 <-
|5.0 - 2.5| = 2.5 |5.0 - 7.3| = 2.3
3 closest right-answers: 3.4, 2.7, 2.4
guess = (3.4 + 2.7 + 2.4) / 3 = 8.5 / 3 ~= 2.83
>> YOUR TURN
Same five-row pile. A new row walks in at income 3.0 (made-up), k = 3. Work the
gaps, pick the 3 closest, average their prices.
check your slate: gaps |3.0-5.6| = 2.6, |3.0-4.0| = 1.0, |3.0-3.8| = 0.8,
|3.0-2.5| = 0.5, |3.0-7.3| = 4.3; 3 smallest gaps 0.5, 0.8, 1.0 -> prices 1.5,
2.4, 2.7; guess = (1.5 + 2.4 + 2.7) / 3 = 6.6 / 3 = 2.2. Move the question and
the guess moves with it -- the rule simply follows whoever its neighbours are.
## k Is the Stiff-vs-Jumpy Dial
k = 1 asks 1 row -> copies its noise exactly
zero mistake on working pile
bad on new rows (too jumpy)
k = large asks many rows -> smooths everything out
always a bit wrong, everywhere (too stiff)
at k = n_working, every guess = average price
right k -> somewhere in between -- found by rotating folds
## Picking k Without Touching the Hidden Pile
IN HAND: the full rule, run once by hand -- five working rows, new row at income 5.0,
three closest answers 3.4, 2.7, 2.4, guess (3.4 + 2.7 + 2.4) / 3 = 8.5 / 3 ~= 2.83 --
and one loose knob, k: small k too jumpy, large k too stiff. This section adds the
honest way to set that knob.
How do you find the right k without ever opening the hidden pile? The same rotating-folds
trick from Part 1: try each candidate k, score it by rotating folds on the working pile,
and keep the winner. Because k sets how jumpy the rule is, the fold scores fall to a
sweet spot and then climb again -- and that dip is your k. The hunt itself, in code, is
waiting at the end of the post.
## Origin
The rule feels too simple to be serious, so it is worth knowing it has a pedigree.
Evelyn Fix and Joseph Hodges wrote it down in 1951. Sixteen years later Thomas Cover and
Peter Hart proved something startling about it: as the pile grows without bound, even
the crude one-closest-row version makes at most TWICE as many mistakes as the best
possible rule on Earth. Ask a single neighbour and you are already halfway to perfect.
Not bad for a rule that does no thinking at all.
1. Guess = average of the k closest working right-answers (optionally weighted).
2. The gap adds squared per-column differences -- put columns on the same ruler,
learning it on the working pile only.
3. No building ahead; each guess costs O(n*d) and stores the whole pile.
4. k sets the stiff-vs-jumpy balance; pick it by rotating folds.
5. When columns pile up, close stops meaning close.
## Common Tripwires I Caught
TRIPWIRE 1: Shape slot -- rows vs columns
WRONG: X.shape[1] gives rows.
RIGHT: shape = (rows, cols). X.shape[0] = rows, X.shape[1] = columns.
Counting features = columns = X.shape[1].
TRIPWIRE 2: Bracket shape -- list vs bag
WRONG: X = df[{'a','b','c'}] (curly brackets = a bag, unordered).
RIGHT: X = df[['a','b','c']] (square brackets = a list, ordered).
Outer [] = fetch from the sheet. Inner [] = the list of names.
TRIPWIRE 3: random_state must be the SAME number everywhere
WRONG: random_state=1 in one place and RANDOM_STATE in another.
RIGHT: RANDOM_STATE = 42 everywhere. Different numbers give different
splits, so the hidden tests fail.
TRIPWIRE 4: k (neighbour count) and fold count (grading rounds) are
unrelated
WRONG: "k=25 with 5 folds -- that can't be right."
RIGHT: k=n_neighbors. 5 folds = n_splits. Totally different knobs.
TRIPWIRE 5: The toolbox maximises, so it negates the mistake
WRONG: pick the k with the highest raw score.
RIGHT: cross_val_score returns NEGATIVE numbers (it maximises).
"Biggest" = closest to zero = smallest mistake.
Un-negate with -scores.mean().
TRIPWIRE 6: min(cv, key=cv.get) vs plain min(cv)
WRONG: plain min(cv) on a shelf of {k: mistake} returns the smallest
KEY (k=1), not the value (mistake).
RIGHT: min(cv, key=cv.get) returns the KEY with the smallest VALUE.
Get the name of the winner, not the smallest name alphabetically.
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
Everything above was pencil and pictures. Here is the same machine in Python, using the
scikit-learn toolbox -- two snippets: the rule itself, and the hunt for the best k.
>> NEW TO PYTHON? Each piece below, named once:
from X import Y -- fetch a ready-made tool named Y out of toolbox X
name = value -- store a value under a name
thing.do(stuff) -- tell an object to do something (a "method")
def name(args): -- define your own reusable step (a "function")
[a, b, c] -- a list: an ordered row of things
{k: v for ...} -- a dict in one line: a labelled shelf, name -> value
Bind the ruler-learner to the ask-closest rule, so the ruler is relearned on each
round's building portion and never on the grading chunk:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
knn = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=5))
knn.fit(X_train, y_train) # stores pile; learns mu, sd on working rows
pred = knn.predict(X_test) # same ruler -> measure gaps -> average
rmse = mean_squared_error(y_test, pred) ** 0.5 # ** 0.5 means square root
print(round(rmse, 3))
And the hunt for the best k -- try each candidate, score it by rotating folds, keep the
winner. Passing the whole bound rule to the checker relearns the ruler each round, so
nothing leaks:
from sklearn.model_selection import KFold, cross_val_score
def fold_mistake(k, X_train, y_train):
rule = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=k))
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scores = cross_val_score(rule, X_train, y_train, cv=kf,
scoring='neg_root_mean_squared_error')
return round(-scores.mean(), 3)
candidates = [1, 3, 5, 7, 9, 15, 25]
cv = {k: fold_mistake(k, X_train, y_train) for k in candidates} # k -> its mistake
best_k = min(cv, key=cv.get) # the k whose mistake is smallest
!! WARN: THREE THINGS THAT TRIP PEOPLE UP
(1) k (how many neighbours) and the fold count (how many grading rounds) are
unrelated -- k=25 with 5 folds is fine.
(2) The toolbox maximises its score, so it negates the mistake; un-negate with
-scores.mean().
(3) min(cv, key=cv.get) returns the key with the smallest value; plain min(cv)
returns the smallest key -- wrong answer.
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 1 -- Predicting House Prices):
Part 1 -- The Full Picture .
Part 2 (this post) .
Part 3 -- Straight-Stick Rule
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================