==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 1 . PREDICTING HOUSE PRICES . PART 1 OF 3
Guessing House Prices, End to End (No Labels Until the End)
Posted: 2026-06-01 . Author: Rahul Rai . Tags: fundamentals, supervised-learning, knn
============================================================================================
PATH . post 1 of 28 (you are at the start -- read top to bottom)
next: House Prices 2: Ask the Closest Rows ->
Welcome -- this is where the blog begins, and it asks nothing of you but curiosity
and a pencil. No prior knowledge, no jargon you have not earned, no hand-waving past
the hard parts.
Here is the promise. By the end of this single post you will have built -- by hand, on
a real sheet of California house prices -- a machine that guesses the price of a house
it has never seen. Not a toy version. The same machinery the textbooks dress up in
Greek letters, except we build it first and name it last. Parts 2 and 3
then take the two rules apart screw by screw. For now, stand back and see the whole
thing at once.
** KEY: THE 1950 CONTRACT
Pretend it is 1950. Print these pages. Pencils, graph paper, a blackboard --
and a room of infinite, tireless clerks for the heavy arithmetic. No computer,
no calculator, anywhere in the teaching. Every number is recomputed where it is
needed (no page assumes you remember the last one), every worked example is
followed by a YOUR TURN drill for your slate, and every cost is counted in
clerk-steps. Your attention is the scarce thing; arithmetic is free.
>> HOW THIS BOOK IS LAID OUT (read once, then forget)
This blog is really a short book in six chapters (twenty-one posts), meant to be read top to bottom.
Every post draws the idea by hand first -- plain words, a pencil sketch, simple
arithmetic. Any runnable Python is gathered into a "The Code, If You Want It"
section at the very END of each post, so you can follow the whole story without ever
stepping over code. That code assumes almost no Python: the handful of things you
need are explained in one short line the first time they appear. Skip the code or
study it -- the idea stands on its own either way.
## The Job: Guess From a Sheet of Numbers
rooms income age ocean people | PRICE <- right-answer column
--------------------------------------------------------
4.2 3.5 25 1.0 320 | 1.4
5.1 5.6 10 0.5 210 | 2.8
3.8 2.1 42 3.2 890 | 0.9
+--------- 8 measured columns ----+| the answer
20,640 rows. Each row is one California neighbourhood. 8 measured columns. One
right-answer column: median house price in $100,000s. The job sounds simple and hides
the whole game: build a rule that guesses the right answer for a row it has NEVER seen
-- not a row it already studied. Anyone can memorise answers they have been shown. The
trick is being right about the houses you have not.
!! WARN: .shape IS A PAIR, NOT A PRODUCT
df.shape returns (n_rows, n_cols) in that fixed order -- (20640, 9) -- never their
product.
## The One Rule: Hide Some Rows
ALL ROWS
################----
working pile (80%) hidden pile (20%)
^ locked away -- read once, at the very end
Before touching anything else, do the one thing every honest guesser does first: set
aside a random 20% of rows and lock them in a drawer. Their right answers must not
touch any decision -- not the rule, not any setting, not even how the numbers are
prepared. They exist for a single honest check, taken once, at the very end.
Why be so strict? The reason is mechanical, not moral. Any rule checked on the same
rows it was built from reports a flattering, false-low mistake -- like grading your own
exam with the answer key open. The ask-closest rule at k=1 (a rule we build a few
sections down: guess by copying the answer of the single most similar row) makes ZERO
mistakes on its own working pile -- every row's most similar row is ITSELF, so it copies
its own answer -- and then stumbles on new rows. The drawer is the only number that does
not lie.
** KEY: SHUFFLE FIRST
Rows often arrive sorted by district. Shuffle with a fixed starting point so the
split is reproducible and not ordered. The fixed point controls reproducibility; the
shuffle removes order bias.
## Rule 1 -- Ask the Closest Rows
IN HAND: a sheet of 20,640 rows -- 8 measured columns plus one right-answer column
(price). Cut 80/20: working pile = 20,640 x 8/10 = 16,512 rows on the desk; hidden
pile = 20,640 - 16,512 = 4,128 rows locked in the drawer. Now the first guessing rule.
new row ?
|
measure gap to every working-pile row
|
pick the k closest
|
average their right-answers -> guess
gap(a, b) = sqrt( sum_j (a_j - b_j)^2 )
One honest IOU, named out loud: this formula is handed to you here, not derived.
WHY square-then-root measures distance (it is the ruler-on-graph-paper distance,
by Pythagoras) and what goes wrong with it when columns wear different units is
exactly Part 2's first job. For this post, watch it work on small numbers:
A concrete 2-column example, by pencil. New row has (rooms=4, income=3). Compare
to one working row (rooms=5, income=3):
column a_j (new) b_j (work) gap_j = a_j - b_j gap_j^2
-----------------------------------------------------------------
rooms 4 5 -1 1
income 3 3 0 0
sum = 1
gap = sqrt(1) = 1
Now compare the same new row to another working row (rooms=8, income=9):
column a_j (new) b_j (work) gap_j = a_j - b_j gap_j^2
-----------------------------------------------------------------
rooms 4 8 -4 16
income 3 9 -6 36
sum = 52
gap = sqrt(52) ~ 7.21
The first working row (gap=1) is much closer than the second (gap=7.21).
If k=1, the guess is the first row's price. If k=3, we average the 3
smallest-gap rows' prices.
>> YOUR TURN (one clerk, one slate)
Same new row (rooms=4, income=3). A third working row sits at (rooms=4,
income=7). Compute its gap before reading on.
check your slate: rooms 4-4 = 0, square 0; income 3-7 = -4, square 16;
sum 0 + 16 = 16; gap = sqrt(16) = 4. Of the three working rows the closest
is still the first (1 < 4 < 7.21) -- at k=1 the guess is that row's price.
This rule barely deserves the name "machine" -- it builds nothing ahead of time. It
just keeps the whole working pile on a shelf and, when a new house walks in, finds the
ones most like it and averages what they sold for. The one knob to set by hand is k:
how many neighbours to ask. Small k -> too jumpy; large k -> too stiff. The full
derivation, and how to find the sweet spot, is in Part 2.
Count the clerk-steps for ONE guess, all 8 columns this time: per working row, 8
subtractions, 8 squarings, 7 additions, 1 root = 24 strokes; times 16,512 working
rows = 16,512 x 24 = 396,288 strokes -- call it four hundred thousand pencil strokes
for a single new house. A room of clerks does it by lunch, but remember the bill:
it comes due every time another house walks in.
## Measuring How Wrong: RMSE
A guesser is only as good as its misses are small -- so before trusting any rule we
need an honest way to size up how wrong it is. Leftover for one row: r_i = y_i - yhat_i
(right answer minus guess).
RMSE = sqrt( (1/n) sum_i r_i^2 )
A concrete 4-person worked example, by pencil:
person truth y guess yhat miss r r^2
--------------------------------------------------
A 1.4 1.2 +0.2 0.04
B 2.8 3.1 -0.3 0.09
C 0.9 1.0 -0.1 0.01
D 3.5 3.2 +0.3 0.09
sum r^2 = 0.23
mean r^2 = 0.23 / 4 = 0.0575
RMSE = sqrt(0.0575) ~ 0.24
Typical miss: about $24,000. The bar to beat is the FOOL who always shouts the
average. Work the fool out on the same 4 people. His one guess: average y =
(1.4+2.8+0.9+3.5)/4 = 8.6/4 = 2.15, shouted four times:
person truth y fool's guess miss r r^2
----------------------------------------------------
A 1.4 2.15 -0.75 0.5625
B 2.8 2.15 +0.65 0.4225
C 0.9 2.15 -1.25 1.5625
D 3.5 2.15 +1.35 1.8225
sum r^2 = 4.3700
mean r^2 = 4.37 / 4 = 1.0925 fool's RMSE = sqrt(1.0925) ~ 1.05
Notice what we just computed: squared gaps from the average, averaged, rooted --
that is the SPREAD of the truth column itself. The fool's RMSE and std(y) are
the same recipe; they cannot help but be the same number. (Set the clerks on the
full 20,640-row column with these exact steps and the fool lands near 0.97.)
Our rule at RMSE = 0.24 beats the fool four times over.
>> YOUR TURN
Person E walks in: truth 2.0, our rule guesses 2.4. Add him to OUR table
(sum r^2 was 0.23 over 4 people) and recompute the rule's RMSE.
check your slate: miss r = 2.0 - 2.4 = -0.4; r^2 = 0.16; new sum = 0.23 +
0.16 = 0.39; mean = 0.39 / 5 = 0.078; RMSE = sqrt(0.078) ~ 0.28. One sloppy
guess dragged the typical miss from $24,000 up to $28,000 -- squaring makes big
misses shout.
RMSE punishes large single misses more heavily than the plain average of absolute
leftovers, MAE = mean|r_i|. Check: person B missed by -0.3, squared to 0.09.
A single double-sized miss of -0.6 would square to 0.36 -- four times the weight.
## Rule 2 -- The Straight-Stick Rule
IN HAND: one guessing rule (ask-closest: measure gaps, average the k nearest answers)
and one honest ruler (RMSE: miss -> square -> mean -> root). Now a second rule of the
opposite temperament.
d+1 dials: [nudge b0, dial1 b1, dial2 b2, ...]
guess = b0 + b1*(rooms) + b2*(income) + ...
= X*beta (one dot-product per row)
The second rule is the opposite temperament. Where the ask-closest rule hoards every
row, this one squeezes the whole working pile into just d+1 dials, throws the pile away,
and from then on each guess is a single dot-product. The dials are not searched for --
they drop out of an exact formula. Full derivation in
Part 3.
## Picking Settings Without Peeking at the Hidden Pile
IN HAND: two rules (ask-closest with its knob k; straight-stick with its d+1 dials),
one ruler (RMSE), and one drawer that must stay shut. Left to do: pick k -- without
the drawer.
working pile split into 5 chunks: A B C D E
round 1 [#] . . . . -> mistake1
round 2 . [#] . . . -> mistake2
round 3 . . [#] . . -> mistake3
round 4 . . . [#] . -> mistake4
round 5 . . . . [#] -> mistake5
average -> honest estimate
Each round, one chunk grades the rule built on the other four. Rotate until every chunk
has graded once. Average the five mistakes. This gives an honest estimate without ever
touching the hidden pile.
!! WARN: PREPARE NUMBERS INSIDE EACH ROUND
Any step that learns from the numbers (an average, a spread) must be done on each
round's building portion only. Doing it on the whole working pile before splitting
leaks grading-chunk numbers into the build step. A bound pipeline enforces this
automatically (Part 2).
That single discipline -- never let the drawer influence a choice -- is the spine the
whole blog hangs on. Procedure: pick settings using rotating folds on the working pile
-> refit the winner on all of the working pile -> read the hidden pile ONCE -> report
that number, and live with it.
>> NOTE: ISN'T CHECKING THE HIDDEN PILE PEEKING?
No -- using it to pick among rules is peeking; using it ONCE to report the chosen
rule's mistake is exactly what it is for. A gap between the rotating-fold estimate
and the hidden-pile mistake means the rule memorised the working pile a little.
## The Whole Blog in Six Lines
Everything above, boiled down to six lines. Pin them somewhere; every later post is one
of these lines, slowed down and worked out in full.
1. Find a rule that guesses right on rows never seen; measure on the hidden pile only.
2. Ask-closest rule -- store the pile, look it up; put columns on the same ruler;
pick k.
3. Straight-stick rule -- d+1 dials, exact formula, one dot-product per guess.
4. Measure mistake with RMSE (same units as price); the bar to beat is std(y).
5. Pick settings by rotating folds, preparing numbers inside each round.
6. Refit on all of working pile; read the hidden pile exactly once.
## Common Tripwires I Caught
TRIPWIRE 1: .shape is a PAIR, not a product
WRONG: treat 20640 x 9 as 185760 (the product).
RIGHT: df.shape returns (n_rows, n_cols) in that fixed order --
(20640, 9) -- never their product.
TRIPWIRE 2: random_state must be the SAME number everywhere
WRONG: use random_state=1 here and random_state=42 there.
RIGHT: use RANDOM_STATE = 42 everywhere. Same seed -> same shuffle
-> your splits match the grader's splits.
TRIPWIRE 3: k (neighbour count) and fold count (grading rounds) are
unrelated
WRONG: "k=25 with 5 folds sounds wrong."
RIGHT: k is how many neighbours to ask; 5 folds is how many chunks
to grade with. Different jobs, different numbers.
TRIPWIRE 4: "Most negative" = MIN, not MAX
WRONG: -0.04 looks bigger than -0.007, so pick -0.007.
RIGHT: -0.04 is further below zero = SMALLER = most negative.
min(dials, key=dials.get) returns the NAME of the smallest.
TRIPWIRE 5: The toolbox maximises, so it negates the mistake
WRONG: pick the candidate with the highest raw cross_val_score.
RIGHT: cross_val_score(..., scoring='neg_root_mean_squared_error')
returns NEGATIVE numbers. The "biggest" (closest to zero) is
the smallest mistake. Un-negate with -scores.mean().
## The Labels, Last
Now -- and only now that you have built the thing -- here are the proper names. Every
post on this blog ends this way: plain words on the left, the textbook term on the
right. Practice it first; the label is just a handle to carry it by.
Plain term used above Standard label
----------------------------------- ------------------------------------------
sheet of written-down numbers dataset / design matrix X
one measured column feature
right-answer column target / label y
working pile training set
hidden pile test set
store the pile, look it up non-parametric / instance-based
ask-closest rule k-nearest neighbours (KNN)
straight-stick rule linear regression (OLS)
dials weights / coefficients beta
fixed nudge intercept / bias beta0
put columns on the same ruler standardisation / normalisation
leftover / mistake (RMSE) residual / loss / error
rotating folds k-fold cross-validation
memorising the working pile overfitting
mistake on rows never seen generalisation error
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same six steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
fetch_california_housing() -- loads the sheet (no file needed)
train_test_split() -- carves off the hidden 20%
StandardScaler() -- puts every column on the same ruler
KNeighborsRegressor(k) -- the ask-closest machine
LinearRegression() -- the straight-stick machine
cross_val_score() -- rotating-folds scorer (returns negated RMSE)
np.sqrt() -- square root
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
RANDOM_STATE = 42 # same seed everywhere so your splits match mine
# --- load the sheet ---
data = fetch_california_housing()
X, y = data.data, data.target # X is (20640, 8), y is (20640,)
print(X.shape, y.shape)
# --- carve off the hidden 20% ---
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RANDOM_STATE)
print(f"working pile: {X_train.shape[0]} rows")
print(f"hidden pile: {X_test.shape[0]} rows")
# --- the always-average fool (baseline) ---
baseline_rmse = np.std(y_train)
print(f"baseline RMSE (fool): {baseline_rmse:.4f}")
# --- ask-closest rule ---
knn_pipe = Pipeline([
("scale", StandardScaler()),
("knn", KNeighborsRegressor(n_neighbors=25))
])
knn_pipe.fit(X_train, y_train)
knn_preds = knn_pipe.predict(X_test)
knn_rmse = np.sqrt(np.mean((y_test - knn_preds)**2))
print(f"KNN RMSE (test): {knn_rmse:.4f}")
# --- straight-stick rule ---
lr_pipe = Pipeline([
("scale", StandardScaler()),
("lr", LinearRegression())
])
lr_pipe.fit(X_train, y_train)
lr_preds = lr_pipe.predict(X_test)
lr_rmse = np.sqrt(np.mean((y_test - lr_preds)**2))
print(f"Linear RMSE (test): {lr_rmse:.4f}")
# --- pick k by rotating folds (never touch the hidden pile here) ---
cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
for k in [5, 10, 25, 50]:
pipe = Pipeline([("scale", StandardScaler()),
("knn", KNeighborsRegressor(n_neighbors=k))])
scores = cross_val_score(pipe, X_train, y_train,
cv=cv,
scoring="neg_root_mean_squared_error")
# scores are negative -- closest to zero is the smallest mistake
print(f"k={k:3d} CV RMSE = {-scores.mean():.4f}")
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 1 -- Predicting House Prices):
Part 1 (this post) .
Part 2 -- Ask-Closest Rule .
Part 3 -- Straight-Stick Rule
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================