==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 1 . PREDICTING HOUSE PRICES . PART 3 OF 3
The Straight-Stick Rule: Setting the Dials, Checking the Leftovers, Getting Numbers Right
Posted: 2026-06-01 . Author: Rahul Rai . Tags: linear-regression, ols, least-squares
============================================================================================
PATH . post 3 of 28
<- prev: House Prices 2: Ask the Closest Rows
next: Grading 1: Two Rulers, MSE and R^2 ->
The ask-closest rule never really learned anything -- it just kept the answer key handy.
This rule is its opposite, and its better-behaved sibling. It actually studies the pile,
distils it down to a handful of dials, and then throws the pile away. From that point on
it carries no baggage: every guess is a quick weighted sum of the columns plus a fixed
nudge.
And here is the part that still feels like magic the first time you see it. The best
possible dials are not hunted for by trial and error -- they fall straight out of one
exact formula. No searching, no luck. This post derives that formula by hand on three
rows, then shows it is the very same arithmetic the toolbox runs on sixteen thousand.
## The Rule
one row of measured things: [rooms, income, age, ...]
x
one dial per column: [b1, b2, b3, ...]
+ fixed nudge b0
= guess yhat
all rows at once: yhat = X*beta
(X has a column of 1s prepended to carry b0)
yhat_i = b0 + sum_j beta_j * x_ij <=> yhat = X*beta
Each dial beta_j is the change in the guess per unit change in column j, all other
columns held fixed. The columns can be squares, logs, or products of raw columns -- the
rule stays linear in the dials.
## The Squared Leftover to Shrink
\ /
\ / height = total squared leftover J(beta)
\ / bottom = the best dials beta
\ . /
Pick the dials that shrink the squared leftover sum:
J(beta) = ||X*beta - y||^2 = sum_i (yhat_i - y_i)^2
J is bowl-shaped -- its second-slope (Hessian) 2*X^T X is never-negative -- so any flat
point is the global bottom. No local traps, no searching.
## Finding the Dials: The Flat-Point Equations
Set the slope of the bowl to zero:
grad J = 2*X^T (X*beta - y) = 0 => X^T X beta = X^T y => beta = (X^T X)^-1 X^T y
Worked by hand at one column (d = 1) on three rows. First, build the raw table:
x = 1, 2, 3 y = 2, 3, 5
row x y x^2 x*y
---------------------------
1 1 2 1 2
2 2 3 4 6
3 3 5 9 15
---------------------------
sum 6 10 14 23
X has a column of 1s for the nudge, plus the x column:
X = [ 1 1 X^T X = [ 3 6 X^T y = [ 10
1 2 6 14 ] 23 ]
1 3 ]
The system X^T X beta = X^T y:
3*b + 6*w = 10 (equation 1)
6*b + 14*w = 23 (equation 2)
Solve by elimination. Multiply equation 1 by 2:
6*b + 12*w = 20 (equation 1, doubled)
Subtract from equation 2:
(6*b + 14*w) - (6*b + 12*w) = 23 - 20
2*w = 3
w = 3/2 = 1.5 <- the dial on x
Put w back into equation 1:
3*b + 6*(3/2) = 10
3*b + 9 = 10
3*b = 1
b = 1/3 ~ 0.333 <- the fixed nudge
stick: yhat = 1/3 + (3/2)*x
yhat = 0.333 + 1.5*x
Check at each x:
x=1: yhat = 0.333 + 1.5*1 = 1.833
x=2: yhat = 0.333 + 1.5*2 = 3.333
x=3: yhat = 0.333 + 1.5*3 = 4.833
>> YOUR TURN
New pile (made-up), same recipe: x = 1, 2, 3 and y = 1, 3, 5. Build the four
sums, write the two flat-point equations, and solve for the dials.
check your slate: sum x = 1+2+3 = 6; sum y = 1+3+5 = 9; sum x^2 = 1+4+9 = 14;
sum x*y = 1 + 6 + 15 = 22. Equations: 3*b + 6*w = 9 and 6*b + 14*w = 22.
Double the first: 6*b + 12*w = 18; subtract: 2*w = 22 - 18 = 4, so w = 2; put
back: 3*b + 6*2 = 9, 3*b = 9 - 12 = -3, b = -1. Stick: yhat = -1 + 2*x, which
hits 1, 3, 5 exactly -- every leftover is zero; this stick runs straight through
all three points.
## Checking the Leftovers
IN HAND: three rows (x = 1, 2, 3 against y = 2, 3, 5) and the dials the flat-point
solve dropped out: 2*w = 3 so w = 3/2 = 1.5, then 3*b = 10 - 6*1.5 = 10 - 9 = 1 so
b = 1/3 ~= 0.333. The stick reads yhat = 0.333 + 1.5*x. This section adds the test
every honest fit must pass: what its leftovers obey.
At the best dials, X^T (y - X*beta) = 0, written X^T e = 0: the leftover column sits at
right-angles to every measured column. Because one column is all ones, the leftovers
also sum to zero.
x=1: yhat=1.833, leftover = +0.167
x=2: yhat=3.333, leftover = -0.333
x=3: yhat=4.833, leftover = +0.167
sum leftover = 0 ok (right-angles to the ones column)
sum x * leftover = 0 ok (right-angles to x column)
RMSE, worked: the exact leftovers are 1/6, -1/3, 1/6; squares 1/36 + 4/36 + 1/36
= 6/36 = 1/6; mean = (1/6) / 3 = 1/18 ~= 0.0556; RMSE = sqrt(1/18) ~= 0.236
>> YOUR TURN
Verify the second right-angle check on your own slate: multiply each leftover by
its x and add. Use the exact fractions: leftovers 1/6, -1/3, 1/6.
check your slate: 1*(1/6) = 1/6; 2*(-1/3) = -2/3 = -4/6; 3*(1/6) = 3/6;
sum = (1 - 4 + 3)/6 = 0/6 = 0. The leftover column sits at right-angles to the
x column: the stick has already squeezed out everything x can explain.
Geometrically: yhat = H*y where the flat-shadow thrower H = X (X^T X)^-1 X^T projects y
onto the flat sheet spanned by the measured columns. H is symmetric and squaring it
gives itself back (H^2 = H). trace(H) equals the number of fitted dials -- here the
nudge b plus the dial w, 1 + 1 = 2. No point on the flat sheet sits closer to y.
** KEY: SAME WORK, BIGGER SHEET
The three-row pencil solve is identical to what the toolbox runs on 16,000-odd rows
(the working pile: 20,640 x 8/10 = 16,512) and 8 columns -- only the size of X^T X
changes. Fitted dials land in m.coef_, the fixed
nudge in m.intercept_.
## When the Straight-Stick Rule Is the Best Possible
The Gauss-Markov result: if the leftover has zero average given the rows (E[eps|X] = 0),
the same spread everywhere, leftovers do not echo each other, and X has full column rank
-- then the straight-stick rule is BLUE: the lowest-spread rule among all flat, honest
rules.
- Bell-curve shape of leftovers is NOT required for BLUE. You need it only for exact
confidence ranges and t-tests, not for the rule to be the best flat rule.
- Full column rank means no measured column is a recipe of the others. If one is,
X^T X is flat (no single solution) and (X^T X)^-1 does not exist.
## Getting the Dials Right: Don't Flip the Sheet
IN HAND: the exact formula beta = (X^T X)^-1 X^T y, proved bowl-bottomed and checked
by hand -- three rows gave w = 3/2 = 1.5 and b = 1/3 ~= 0.333, with leftovers
1/6 - 2/6 + 1/6 = 0 summing to zero. This section adds the bill for the solve, and
the safe way to run it.
build X^T X: O(n*d^2) <- n rows, each d^2 multiplications
solve for beta: O(d^3)
total: O(n*d^2 + d^3)
Pencil that bill in clerk-steps on our pile. With the ones column, d = 8 + 1 = 9, so
each row costs 9 x 9 = 81 multiplications; the working pile's 20,640 x 8/10 = 16,512
rows cost 16,512 x 81 = 1,337,472 strokes to build X^T X, and the solve adds about
9 x 9 x 9 = 729 more. A room of clerks clears it in a day or two -- ONCE. After that
every guess is 8 multiplications + 8 additions = 16 strokes: one clerk, one breath.
Do NOT compute (X^T X)^-1 by flipping the sheet directly. Forming X^T X squares the
shakiness number (kappa(X^T X) = kappa(X)^2) and loses precision. Instead:
- Solve the flat-point equations directly (Cholesky cut of X^T X), or better,
- Cut X itself with QR or SVD and solve without ever forming X^T X. The toolbox's
LinearRegression uses an SVD-based approach.
- For very many rows or a stream, roll downhill a little at a time: O(n*d) per step
instead of a full cut.
- If X^T X is flat or shaky (one column is a near-recipe of another), add a small bump
to the diagonal -- the ridge rule: beta = (X^T X + lambda*I)^-1 X^T y -- solvable
for any lambda > 0.
>> NOTE: SAME-RULER CONTRAST WITH ASK-CLOSEST RULE
The straight-stick rule does NOT care which ruler you used. Rescale a column and its
dial rescales inversely -- guesses and RMSE stay the same. (The ask-closest rule
breaks without same-ruler treatment.) You DO need same-ruler when rolling downhill or
when adding a bump, because those care about the size of the dials, which depends on
scale.
## Ask-Closest vs. Straight-Stick
IN HAND: both rules of this chapter, built and checked. Ask-closest keeps the whole
working pile (20,640 x 8/10 = 16,512 rows) and pays 16,512 x 24 = 396,288 strokes per
guess; straight-stick keeps 8 + 1 = 9 dials and pays 8 multiplications + 8 additions
= 16 strokes per guess. This section adds the two temperaments side by side.
Ask-Closest Rule (KNN) Straight-Stick Rule (OLS)
------------------ ------------------------ --------------------------
What it stores whole pile d+1 dials only
Build cost O(1) (store only) O(n*d^2 + d^3) (solve)
Guess cost O(n*d) per row O(d) per row
Setting by hand k (by rotating folds) none (lambda if adding bump)
Same-ruler needed? yes -- required no -- not for plain rule
Guess surface piecewise-flat, local one flat sheet, global
Two rules, opposite temperaments, one question: given the working rows, guess the right
answer for a new one. The ask-closest rule leans on stored neighbours and does its
thinking at the last second; the straight-stick rule does its thinking up front and
compresses everything into one flat sheet. Neither is "the" answer -- which to reach for
depends on the pile in front of you, and that judgement is the real craft.
## Going Deeper
- flat-point equations, ordinary least squares -- the solve
- QR / SVD least squares, shakiness number (condition number) -- the numerics
- flat-shadow thrower (hat matrix), right-angle leftovers -- the geometry
- Gauss-Markov, uneven miss spread -- the assumptions
- ridge, lasso -- when columns are near-recipes of each other
1. Rule: yhat = X*beta; set dials by shrinking ||X*beta - y||^2.
2. Bowl-shaped mistake -> exact dials from beta = (X^T X)^-1 X^T y.
3. Leftovers sit at right-angles to measured columns; yhat = H*y is a flat-sheet
shadow.
4. Straight-stick rule is BLUE under Gauss-Markov; bell-curve leftovers needed only
for tests.
5. Cut the sheet (QR/SVD), don't flip it; add a bump when X^T X is flat.
## Common Tripwires I Caught
TRIPWIRE 1: MSE has NO root in Lab 2
WRONG: tack on ** 0.5 to mean_squared_error(...).
RIGHT: MSE = miss -> square -> mean -> STOP. No root.
RMSE has the root. MSE does not. Read the name.
TRIPWIRE 2: leftover is the BAD part, not the good part
WRONG: "leftover = 2, that's the stick's score."
RIGHT: leftover is the misery STILL LEFT after the stick tried.
The good part is total wobble - leftover.
R^2 = (total wobble - leftover) / total wobble.
TRIPWIRE 3: R^2 divide by the WHOLE BAR, not the count
WRONG: R^2 = (total wobble - leftover) / 5 (row count).
RIGHT: R^2 = (total wobble - leftover) / total wobble.
The count cancels, which is WHY R^2 has no units.
TRIPWIRE 4: R^2 always rises when you add a column -- on the training
pile only
WRONG: "I added a noise column and R^2 went up -- the machine is better."
RIGHT: On the training pile, the old fit is still available, so it can
only do at least as well. Use the TEST R^2 or adjusted R^2.
TRIPWIRE 5: R^2 can go negative on the test pile
WRONG: "R^2 < 0 must be a bug."
RIGHT: On unseen rows, no right-angle guarantee holds. The leftover
CAN exceed the wobble. R^2 = 1 - RSS/TSS can dip below 0.
TRIPWIRE 6: "Most negative" dragging column = MIN, not MAX
WRONG: -0.04 is bigger than -0.007 by the number 4 vs 7.
RIGHT: -0.04 is further below zero = SMALLER = most negative.
min(dials, key=dials.get) returns the column NAME.
TRIPWIRE 7: Raw dials are NOT comparable across columns (different
units)
WRONG: compare -0.007 per pound directly with -0.04 per horsepower.
RIGHT: Put every column on one shared ruler first (StandardScaler).
Then each dial means "answer-units per one-spread step."
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
The whole pencil derivation above is what the toolbox does for you in four lines. Notice
there is no setting to tune and no rotating-fold hunt here -- the dials come straight from
the exact formula, so LinearRegression simply solves and stops.
>> NEW TO PYTHON? One new piece here:
thing.coef_ -- a value the object stores after being fitted; sklearn marks
every learned-from-data value with a trailing underscore
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
m = LinearRegression() # SVD-based flat-point solve under the hood
m.fit(X_train, y_train) # sets dials: X^T X beta = X^T y
pred = m.predict(X_test) # one dot-product per row
rmse = mean_squared_error(y_test, pred) ** 0.5
print(round(rmse, 3), m.coef_, m.intercept_)
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 1 -- Predicting House Prices):
Part 1 -- The Full Picture .
Part 2 -- Ask-Closest Rule .
Part 3 (this post)
Next chapter: Chapter 2 -- Grading a Guesser (MSE, R^2, reading the dials)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================