==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 2 . GRADING A GUESSER . PART 1 OF 2
Two Rulers for One Guess: MSE and R^2
Posted: 2026-06-04 . Author: Rahul Rai . Tags: evaluation, mse, r-squared, regression
============================================================================================
PATH . post 4 of 28
<- prev: House Prices 3: The Straight-Stick Rule
next: Grading 2: Reading the Dials ->
We have built two guessers. Now comes the question that quietly decides everything, and
that beginners skip right past: once the stick has drawn its line -- WAS THE LINE ANY
GOOD? A guess you cannot grade is a guess you cannot trust.
Same straight-stick rule as before -- a dial on each column, multiply, add, plus a fixed
nudge -- but a new sheet to play on: cars now, not houses. Two different rulers will
step up to grade the very same guesses, and here is the twist that makes the post worth
reading: they hand back two different verdicts, and BOTH are right. One tells you how
big the misses are. The other tells you whether you have beaten the village idiot.
## Same Stick, New Sheet
displacement horsepower weight acceleration | MPG <- right answer
-----------------------------------------------------------
307 130 3504 12.0 | 18
350 165 3693 11.5 | 15
... | ...
one row = one car | the answer
guess = d1*disp + d2*horse + d3*weight + d4*accel + nudge
Pick four columns, guess the fifth. The machinery is exactly the
straight-stick rule -- set the dials by
shrinking the squared leftover, one dot-product per guess. Nothing new to build. The new
work is grading.
## Ruler One: MSE -- Just the Size of the Miss
Earlier the mistake-score was RMSE. Four steps, in order:
miss -> square -> mean -> root = RMSE
miss -> square -> mean -> STOP = MSE
Stare at the name and it gives itself away:
M S E
| | +- Error the miss (y - guess)
| +--- Squared kill the sign
+----- Mean average them
No R. No Root. So do the first three steps and stop -- leave off the ** 0.5 you tacked
on last time.
MSE = (1/n) sum_i (y_i - yhat_i)^2
>> NOTE: WHAT MSE IS, EXACTLY
MSE is RMSE squared, so it lives in SQUARED answer-units -- "miles per gallon,
squared," which no one can picture. That is the price of dropping the root. What you
buy: MSE is the exact thing the straight-stick rule shrinks when it sets the dials, so
it is the rule's own home ruler. RMSE is just MSE pulled back into plain units for a
human to read. Same ordering of which fit is better -- the root is a rising step, so
it never swaps the order.
Count the clerk-steps for MSE on a whole sheet of 400 cars (made-up count): per car,
1 subtraction and 1 squaring = 2 strokes, so 400 x 2 = 800; then 399 additions to sum
and 1 division by the count -- 800 + 399 + 1 = 1,200 strokes. One clerk clears it in
an hour; the room, before the coffee cools.
## The Problem MSE Can't Solve
"the guess was off by 9."
good? bad? ...compared to WHAT?
A raw miss-size has no scale of its own. Off by 9 mpg is dreadful for a car; off by 9
feet is wonderful for an archer firing across a valley. The number alone won't say. You
need something to measure it AGAINST. That is the second ruler.
## Ruler Two: R^2 -- A Race Against a Fool
IN HAND: a sheet of cars, a stick that guesses mpg from four columns, and one ruler --
MSE: miss -> square -> mean -> stop -- which sizes the misses but cannot say whether
the size is good. This section adds the second ruler: a race against an opponent.
A score needs something to be scored against, so R^2 picks the dumbest honest opponent
imaginable and makes the stick race it. Meet the fool. He has not looked at a single
column -- he does not even know the cars have engines. All he has is a pile of past
answers and no clues at all:
answers: 2 4 9 3 7
The fool must shout one number for every car, no matter what. What is his safest shout?
The AVERAGE -- (2+4+9+3+7)/5 = 25/5 = 5 -- not because the fool is clever (he is not),
but because the average is the one spot that sits at the very bottom of the
squared-miss bowl. Any other single
number he could pick misses more:
shout 5 -> miss^2 = 9 + 1 + 16 + 4 + 4 = 34
shout 6 -> miss^2 = 16 + 4 + 9 + 9 + 1 = 39 (worse)
** KEY: WHY THE AVERAGE IS THE FLOOR
Shout some number c. Total squared miss is f(c) = sum (y_i - c)^2. Tip it to find the
bottom: f'(c) = -2 sum (y_i - c) = 0. Divide both sides by -2: sum (y_i - c) = 0.
That sum is sum y_i minus n copies of c, so sum y_i - n*c = 0, i.e. c = (1/n) sum y_i
-- the average, exactly.
The curve is a bowl (second slope 2n > 0), so that point is the lowest, not the
highest. The fool's best constant shout is always the mean.
A concrete check, by pencil. Try four different constant guesses against the
same answers (2, 4, 9, 3, 7):
guess c misses^2 total squared miss
------------------------------------------------------------------------
4 (2-4)^2+(4-4)^2+(9-4)^2+(3-4)^2+(7-4)^2 = 4+0+25+1+9 = 39
5 (avg) (2-5)^2+(4-5)^2+(9-5)^2+(3-5)^2+(7-5)^2 = 9+1+16+4+4 = 34 <- smallest
6 (2-6)^2+(4-6)^2+(9-6)^2+(3-6)^2+(7-6)^2 =16+4+9+9+1 = 39
7 (2-7)^2+(4-7)^2+(9-7)^2+(3-7)^2+(7-7)^2 =25+9+4+16+0 = 54
The average (5) gives the smallest total. Any other number misses more.
>> YOUR TURN
Same answers (2, 4, 9, 3, 7). Try the constant shout c = 3 and see where it
lands against the average's 34.
check your slate: (2-3)^2 = 1; (4-3)^2 = 1; (9-3)^2 = 36; (3-3)^2 = 0;
(7-3)^2 = 16; total = 1 + 1 + 36 + 0 + 16 = 54. Worse than the average's
34 -- the bowl rises on both sides of 5.
Call the fool's total misery the TOTAL WOBBLE. Now race:
truth: 2 4 9 3 7
FOOL (says 5): miss^2 = 9, 1, 16, 4, 4 -> 34 <- total wobble
STICK (uses cols): says 2.5, 4.5, 8, 3.5, 6.5
miss^2 = .25,.25,1,.25,.25 -> 2 <- leftover
(stick's misses, worked: 2-2.5 = -.5, 4-4.5 = -.5, 9-8 = +1, 3-3.5 = -.5,
7-6.5 = +.5; squares .25, .25, 1, .25, .25; sum = 2.)
!! WARN: LEFTOVER IS THE BAD PART, NOT THE GOOD PART
It is tempting to read "leftover = 2" as the stick's score. It is the opposite:
leftover is the misery STILL LEFT after the stick tried -- the small bad bit it could
not explain. The GOOD part is what the stick ATE: total wobble - leftover.
total wobble #################### 34 (the fool's whole job)
stick ATE ################## 32 = 34 - 2 (the good part)
leftover # 2 (the bad bit still left)
R^2 is the slice the stick ate, as a fraction of the whole bar:
R^2 = (total wobble - leftover) / total wobble = 32 / 34 ~= 0.94
!! WARN: DIVIDE BY THE WHOLE BAR, NOT THE COUNT
The divider is the WHOLE wobble (34), not the row count (5). R^2 asks "what fraction
of the WHOLE wobble did you eat?" -- a slice of a bar, not an average. The count sits
on top and bottom and cancels, which is the very reason R^2 carries no units.
>> YOUR TURN
A clumsier stick (made-up) guesses 3, 5, 8, 2, 6 for the same truths
(2, 4, 9, 3, 7). The fool's total wobble is still 9 + 1 + 16 + 4 + 4 = 34.
Work this stick's leftover and its R^2.
check your slate: misses 2-3 = -1, 4-5 = -1, 9-8 = +1, 3-2 = +1, 7-6 = +1;
squares 1, 1, 1, 1, 1; leftover = 5. Ate = 34 - 5 = 29. R^2 = 29/34 ~ 0.85.
A bigger leftover means a smaller slice eaten -- this stick grades below the
first one's 32/34 ~ 0.94.
The scale falls right out:
R^2 = 1.0 ate everything, leftover zero -- a perfect line
R^2 = 0.0 ate nothing -- no better than the fool
R^2 < 0 WORSE than the fool -- the line actively hurts
It can never top 1: you cannot eat more than the whole bar. But it CAN dip below 0, and
on a sheet the stick has never seen, it sometimes does.
## Guess Once, Grade Twice
IN HAND: two rulers for the same misses. MSE: miss -> square -> mean -> stop. R^2:
the slice eaten -- on the five cars the fool's wobble is 9+1+16+4+4 = 34, the stick's
leftover .25+.25+1+.25+.25 = 2, so R^2 = (34 - 2)/34 = 32/34 ~ 0.94. This section
adds the discipline: guess once, then hold both rulers to the same guesses.
set the dials -> guess the sealed pile -> guesses (computed ONCE)
|
+----------+----------+
v v
MSE -> a SIZE R^2 -> a SLICE
The two rulers do not need two sets of guesses. You guess once, then hold both rulers
against the same misses. MSE says how big the misses are; R^2 says how much better than
a fool you did. The few lines that do this -- one little helper that guesses a single
time and holds up both rulers -- are waiting at the end of the post.
The clerk bill agrees. Guessing is the dear part: one guess = 4 multiplications +
4 additions = 8 strokes, so 400 cars (made-up count) cost 400 x 8 = 3,200 strokes.
MSE after that is the 1,200 counted above; R^2 adds the fool's wobble (near 1,600
strokes) and a 2-stroke slice. Both rulers together cost less than guessing once.
And the old discipline holds: open the sealed pile ONCE, to grade -- never to choose.
Grading on rows the stick never saw is the whole job of the hidden pile.
## Three Things Worth Stating Precisely
IN HAND: one set of guesses off the sealed pile, graded twice -- MSE for the size of
the misses, R^2 for the slice of the fool's wobble eaten. This section adds the
fine print: the facts about the two rulers that are easiest to get wrong.
>> NOTE: ON TRAINING THE BAR SPLITS CLEANLY -- ON TEST IT NEED NOT
With a nudge column, the straight-stick rule forces its leftovers to sit at
right-angles to its guesses, so on the TRAINING pile the bar splits exactly:
total wobble = ate + leftover (textbook: TSS = ESS + RSS), and R^2 lands in [0, 1].
On the TEST pile no such right-angle holds: the leftover can be larger than the
wobble, so R^2 = 1 - RSS/TSS can fall below zero. A negative test R^2 is not a bug --
it is the line doing worse than shouting the average.
>> NOTE: ONE COLUMN -- R^2 IS JUST CORRELATION, SQUARED
Fit a straight stick on a single column and R^2 equals the squared correlation
between that column and the answer -- that is literally why it wears the name "R
squared" (r is the correlation). With several columns it generalises to the squared
correlation between the answers and the stick's guesses.
!! WARN: R^2 ALWAYS RISES WHEN YOU ADD A COLUMN -- ON THE TRAINING PILE
Hand the stick one more column and training R^2 never drops, even if the column is
pure noise -- the old fit is still available, so it can only do at least as well.
That is why training R^2 is a poor judge of whether a column earns its place: use the
TEST R^2, or the adjusted R^2 that docks you for each extra dial.
** KEY: WHERE THE MISS-SIZE COMES FROM (STIFF vs JUMPY)
Average a guesser's test MSE over many possible training piles and it splits three
ways: E[MSE] = bias^2 + variance + noise. Bias is the line being too stiff to follow
the true shape; variance is the line being too jumpy, swinging with each pile it sees;
noise is the part no rule can ever remove. The stiff-vs-jumpy dial from the neighbours
post is exactly the bias-vs-variance trade, seen through MSE.
One honest IOU, named out loud: that three-way split is STATED here, not derived.
The proof expands (truth - guess)^2 around the average guess and shows the cross
terms die -- average-of-averages algebra this post does not carry. Until the blog
works it by pencil, treat the split as a named promise, not an earned fact. What
IS earned here: the three words mean real things you can already see -- a flat
line on curvy truth (stiff), a wild line that changes with every re-deal (jumpy),
and the scatter no line removes (noise).
1. MSE = miss -> square -> mean -> stop. RMSE without the root, in squared
answer-units, and the very thing the stick shrinks.
2. A raw miss-size means nothing alone -- "off by 9" needs something to measure
against.
3. R^2 races the stick against a fool who only ever shouts the average (the bottom of
the squared-miss bowl).
4. R^2 = (total wobble - leftover) / total wobble = 1 - RSS/TSS. No units. Tops out at
1; can go below 0 on unseen rows.
5. Guess once, grade twice: the same guesses feed both rulers.
## Common Tripwires I Caught
TRIPWIRE 1: MSE has NO root -- read the name
WRONG: mean_squared_error(...) then ** 0.5.
RIGHT: MSE = miss -> square -> mean -> STOP. No root.
The tool already squares-and-means. Do NOT root it.
RMSE has the root. MSE does not.
TRIPWIRE 2: leftover is the BAD part, not the good part
WRONG: "leftover = 2 -- that's the stick's score."
RIGHT: leftover is the misery STILL LEFT after the stick tried.
The good part is what the stick ATE: total wobble - leftover.
TRIPWIRE 3: R^2 divide by the WHOLE wobble, not the count
WRONG: R^2 = (total wobble - leftover) / 5 (5 rows).
RIGHT: R^2 = (total wobble - leftover) / total wobble.
Divide by the whole bar, not the row count.
The count cancels, which is why R^2 has no units.
TRIPWIRE 4: R^2 always rises when you add a column -- on the
training pile only
WRONG: "Added a noise column and R^2 went up -- better fit."
RIGHT: The old fit is still a candidate, so it can only do at
least as well. Use TEST R^2 or adjusted R^2 to judge.
TRIPWIRE 5: R^2 can go negative on the test pile
WRONG: "R^2 < 0 is a bug."
RIGHT: On unseen rows, the leftover CAN exceed the wobble.
R^2 = 1 - RSS/TSS can dip below 0. Not a bug -- the
line is doing worse than shouting the average.
TRIPWIRE 6: Which pile feeds the rulers
WRONG: compute MSE and R^2 on the TRAINING pile.
RIGHT: Both MSE and R^2 grade on the SEALED test pile.
The training pile was used to set the dials. Grading
on it gives a flattering, false-low mistake.
TRIPWIRE 7: "Most negative" = MIN, not MAX
WRONG: -0.04 looks bigger than -0.007, so it's the strongest.
RIGHT: -0.04 is further below zero = SMALLER = most negative.
min(dials, key=dials.get) returns the column name.
TRIPWIRE 8: Raw dials are NOT comparable across columns
WRONG: compare -0.007 per pound with -0.04 per horsepower.
RIGHT: Each dial wears its column's own units. Put every
column on one shared ruler first (StandardScaler) for
an honest comparison.
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
Guess once, then hold both rulers against the same misses. A small helper keeps it
honest: it predicts a single time and hands back both scores together.
>> NEW TO PYTHON? One new piece here:
return (a, b) -- hand back two values at once, bundled as a pair (a "tuple")
from sklearn.metrics import mean_squared_error, r2_score
def evaluate_guesser(model, X_test, y_test):
guesses = model.predict(X_test) # guess ONCE
mse = round(mean_squared_error(y_test, guesses), 3) # ruler 1: a size
r2 = round(r2_score(y_test, guesses), 3) # ruler 2: a slice
return (mse, r2) # truth first, guesses second
## The Labels, Last
Plain term used above Standard label
------------------------------------- ----------------------------------------
size of miss, squared and averaged mean squared error (MSE)
...with the root put back root mean squared error (RMSE)
the always-average fool the mean / baseline predictor
total wobble total sum of squares (TSS)
leftover residual sum of squares (RSS)
what the stick ate explained sum of squares (ESS)
slice of wobble eaten coefficient of determination (R^2)
too stiff vs too jumpy bias-variance
docked for each extra dial adjusted R^2
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 2 -- Grading a Guesser):
Part 1 (this post) .
Part 2 -- Reading the Dials
Previous chapter: Chapter 1 -- Predicting House Prices
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================