Two Rulers for One Guess: MSE and R^2

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 2 . GRADING A GUESSER . PART 1 OF 2
  Two Rulers for One Guess: MSE and R^2
  Posted: 2026-06-04 . Author: Rahul Rai . Tags: evaluation, mse, r-squared, regression
  ============================================================================================

  PATH . post 4 of 28
    <- prev:  House Prices 3: The Straight-Stick Rule
       next:  Grading 2: Reading the Dials ->

  We have built two guessers. Now comes the question that quietly decides everything, and
  that beginners skip right past: once the stick has drawn its line -- WAS THE LINE ANY
  GOOD? A guess you cannot grade is a guess you cannot trust.

  Same straight-stick rule as before -- a dial on each column, multiply, add, plus a fixed
  nudge -- but a new sheet to play on: cars now, not houses. Two different rulers will
  step up to grade the very same guesses, and here is the twist that makes the post worth
  reading: they hand back two different verdicts, and BOTH are right. One tells you how
  big the misses are. The other tells you whether you have beaten the village idiot.


  ## Same Stick, New Sheet

    displacement  horsepower  weight  acceleration  |  MPG  <- right answer
    -----------------------------------------------------------
       307          130        3504       12.0      |  18
       350          165        3693       11.5      |  15
        ...                                         |  ...
    one row = one car                               |  the answer

      guess = d1*disp + d2*horse + d3*weight + d4*accel + nudge

  Pick four columns, guess the fifth. The machinery is exactly the
  straight-stick rule -- set the dials by
  shrinking the squared leftover, one dot-product per guess. Nothing new to build. The new
  work is grading.


  ## Ruler One: MSE -- Just the Size of the Miss

  Earlier the mistake-score was RMSE. Four steps, in order:

    miss -> square -> mean -> root   = RMSE
    miss -> square -> mean -> STOP   = MSE

  Stare at the name and it gives itself away:

    M S E
    | | +- Error    the miss   (y - guess)
    | +--- Squared  kill the sign
    +----- Mean     average them

  No R. No Root. So do the first three steps and stop -- leave off the ** 0.5 you tacked
  on last time.

      MSE = (1/n) sum_i (y_i - yhat_i)^2

  >> NOTE: WHAT MSE IS, EXACTLY
     MSE is RMSE squared, so it lives in SQUARED answer-units -- "miles per gallon,
     squared," which no one can picture. That is the price of dropping the root. What you
     buy: MSE is the exact thing the straight-stick rule shrinks when it sets the dials, so
     it is the rule's own home ruler. RMSE is just MSE pulled back into plain units for a
     human to read. Same ordering of which fit is better -- the root is a rising step, so
     it never swaps the order.

  Count the clerk-steps for MSE on a whole sheet of 400 cars (made-up count): per car,
  1 subtraction and 1 squaring = 2 strokes, so 400 x 2 = 800; then 399 additions to sum
  and 1 division by the count -- 800 + 399 + 1 = 1,200 strokes.  One clerk clears it in
  an hour; the room, before the coffee cools.


  ## The Problem MSE Can't Solve

    "the guess was off by 9."

    good?  bad?   ...compared to WHAT?

  A raw miss-size has no scale of its own. Off by 9 mpg is dreadful for a car; off by 9
  feet is wonderful for an archer firing across a valley. The number alone won't say. You
  need something to measure it AGAINST. That is the second ruler.


  ## Ruler Two: R^2 -- A Race Against a Fool

  IN HAND: a sheet of cars, a stick that guesses mpg from four columns, and one ruler --
  MSE: miss -> square -> mean -> stop -- which sizes the misses but cannot say whether
  the size is good.  This section adds the second ruler: a race against an opponent.

  A score needs something to be scored against, so R^2 picks the dumbest honest opponent
  imaginable and makes the stick race it. Meet the fool. He has not looked at a single
  column -- he does not even know the cars have engines. All he has is a pile of past
  answers and no clues at all:

    answers:  2   4   9   3   7

  The fool must shout one number for every car, no matter what. What is his safest shout?
  The AVERAGE -- (2+4+9+3+7)/5 = 25/5 = 5 -- not because the fool is clever (he is not),
  but because the average is the one spot that sits at the very bottom of the
  squared-miss bowl. Any other single
  number he could pick misses more:

    shout 5 -> miss^2 = 9 + 1 + 16 + 4 + 4 = 34
    shout 6 -> miss^2 = 16 + 4 + 9 + 9 + 1 = 39   (worse)

  ** KEY: WHY THE AVERAGE IS THE FLOOR
     Shout some number c. Total squared miss is f(c) = sum (y_i - c)^2. Tip it to find the
     bottom: f'(c) = -2 sum (y_i - c) = 0. Divide both sides by -2: sum (y_i - c) = 0.
     That sum is sum y_i minus n copies of c, so sum y_i - n*c = 0, i.e. c = (1/n) sum y_i
     -- the average, exactly.
     The curve is a bowl (second slope 2n > 0), so that point is the lowest, not the
     highest. The fool's best constant shout is always the mean.

  A concrete check, by pencil.  Try four different constant guesses against the
  same answers (2, 4, 9, 3, 7):

    guess c    misses^2                         total squared miss
    ------------------------------------------------------------------------
    4          (2-4)^2+(4-4)^2+(9-4)^2+(3-4)^2+(7-4)^2 = 4+0+25+1+9  = 39
    5 (avg)    (2-5)^2+(4-5)^2+(9-5)^2+(3-5)^2+(7-5)^2 = 9+1+16+4+4  = 34  <- smallest
    6          (2-6)^2+(4-6)^2+(9-6)^2+(3-6)^2+(7-6)^2 =16+4+9+9+1   = 39
    7          (2-7)^2+(4-7)^2+(9-7)^2+(3-7)^2+(7-7)^2 =25+9+4+16+0  = 54

    The average (5) gives the smallest total.  Any other number misses more.

  >> YOUR TURN
     Same answers (2, 4, 9, 3, 7).  Try the constant shout c = 3 and see where it
     lands against the average's 34.

     check your slate:  (2-3)^2 = 1;  (4-3)^2 = 1;  (9-3)^2 = 36;  (3-3)^2 = 0;
     (7-3)^2 = 16;  total = 1 + 1 + 36 + 0 + 16 = 54.  Worse than the average's
     34 -- the bowl rises on both sides of 5.

  Call the fool's total misery the TOTAL WOBBLE. Now race:

    truth:          2     4     9     3     7

    FOOL  (says 5):  miss^2 =  9, 1, 16, 4, 4   -> 34   <- total wobble
    STICK (uses cols): says 2.5, 4.5, 8, 3.5, 6.5
                       miss^2 = .25,.25,1,.25,.25 ->  2  <- leftover

    (stick's misses, worked: 2-2.5 = -.5, 4-4.5 = -.5, 9-8 = +1, 3-3.5 = -.5,
     7-6.5 = +.5;  squares .25, .25, 1, .25, .25;  sum = 2.)

  !! WARN: LEFTOVER IS THE BAD PART, NOT THE GOOD PART
     It is tempting to read "leftover = 2" as the stick's score. It is the opposite:
     leftover is the misery STILL LEFT after the stick tried -- the small bad bit it could
     not explain. The GOOD part is what the stick ATE: total wobble - leftover.

    total wobble  ####################   34   (the fool's whole job)
    stick ATE     ##################     32   = 34 - 2   (the good part)
    leftover      #                       2   (the bad bit still left)

  R^2 is the slice the stick ate, as a fraction of the whole bar:

      R^2 = (total wobble - leftover) / total wobble = 32 / 34 ~= 0.94

  !! WARN: DIVIDE BY THE WHOLE BAR, NOT THE COUNT
     The divider is the WHOLE wobble (34), not the row count (5). R^2 asks "what fraction
     of the WHOLE wobble did you eat?" -- a slice of a bar, not an average. The count sits
     on top and bottom and cancels, which is the very reason R^2 carries no units.

  >> YOUR TURN
     A clumsier stick (made-up) guesses 3, 5, 8, 2, 6 for the same truths
     (2, 4, 9, 3, 7).  The fool's total wobble is still 9 + 1 + 16 + 4 + 4 = 34.
     Work this stick's leftover and its R^2.

     check your slate:  misses 2-3 = -1, 4-5 = -1, 9-8 = +1, 3-2 = +1, 7-6 = +1;
     squares 1, 1, 1, 1, 1;  leftover = 5.  Ate = 34 - 5 = 29.  R^2 = 29/34 ~ 0.85.
     A bigger leftover means a smaller slice eaten -- this stick grades below the
     first one's 32/34 ~ 0.94.

  The scale falls right out:

    R^2 = 1.0   ate everything, leftover zero   -- a perfect line
    R^2 = 0.0   ate nothing -- no better than the fool
    R^2 < 0     WORSE than the fool -- the line actively hurts

  It can never top 1: you cannot eat more than the whole bar. But it CAN dip below 0, and
  on a sheet the stick has never seen, it sometimes does.


  ## Guess Once, Grade Twice

  IN HAND: two rulers for the same misses.  MSE: miss -> square -> mean -> stop.  R^2:
  the slice eaten -- on the five cars the fool's wobble is 9+1+16+4+4 = 34, the stick's
  leftover .25+.25+1+.25+.25 = 2, so R^2 = (34 - 2)/34 = 32/34 ~ 0.94.  This section
  adds the discipline: guess once, then hold both rulers to the same guesses.

    set the dials -> guess the sealed pile -> guesses (computed ONCE)
                                                   |
                                        +----------+----------+
                                        v                     v
                                  MSE -> a SIZE          R^2 -> a SLICE

  The two rulers do not need two sets of guesses. You guess once, then hold both rulers
  against the same misses. MSE says how big the misses are; R^2 says how much better than
  a fool you did. The few lines that do this -- one little helper that guesses a single
  time and holds up both rulers -- are waiting at the end of the post.

  The clerk bill agrees.  Guessing is the dear part: one guess = 4 multiplications +
  4 additions = 8 strokes, so 400 cars (made-up count) cost 400 x 8 = 3,200 strokes.
  MSE after that is the 1,200 counted above; R^2 adds the fool's wobble (near 1,600
  strokes) and a 2-stroke slice.  Both rulers together cost less than guessing once.

  And the old discipline holds: open the sealed pile ONCE, to grade -- never to choose.
  Grading on rows the stick never saw is the whole job of the hidden pile.


  ## Three Things Worth Stating Precisely

  IN HAND: one set of guesses off the sealed pile, graded twice -- MSE for the size of
  the misses, R^2 for the slice of the fool's wobble eaten.  This section adds the
  fine print: the facts about the two rulers that are easiest to get wrong.

  >> NOTE: ON TRAINING THE BAR SPLITS CLEANLY -- ON TEST IT NEED NOT
     With a nudge column, the straight-stick rule forces its leftovers to sit at
     right-angles to its guesses, so on the TRAINING pile the bar splits exactly:
     total wobble = ate + leftover (textbook: TSS = ESS + RSS), and R^2 lands in [0, 1].
     On the TEST pile no such right-angle holds: the leftover can be larger than the
     wobble, so R^2 = 1 - RSS/TSS can fall below zero. A negative test R^2 is not a bug --
     it is the line doing worse than shouting the average.

  >> NOTE: ONE COLUMN -- R^2 IS JUST CORRELATION, SQUARED
     Fit a straight stick on a single column and R^2 equals the squared correlation
     between that column and the answer -- that is literally why it wears the name "R
     squared" (r is the correlation). With several columns it generalises to the squared
     correlation between the answers and the stick's guesses.

  !! WARN: R^2 ALWAYS RISES WHEN YOU ADD A COLUMN -- ON THE TRAINING PILE
     Hand the stick one more column and training R^2 never drops, even if the column is
     pure noise -- the old fit is still available, so it can only do at least as well.
     That is why training R^2 is a poor judge of whether a column earns its place: use the
     TEST R^2, or the adjusted R^2 that docks you for each extra dial.

  ** KEY: WHERE THE MISS-SIZE COMES FROM (STIFF vs JUMPY)
     Average a guesser's test MSE over many possible training piles and it splits three
     ways: E[MSE] = bias^2 + variance + noise. Bias is the line being too stiff to follow
     the true shape; variance is the line being too jumpy, swinging with each pile it sees;
     noise is the part no rule can ever remove. The stiff-vs-jumpy dial from the neighbours
     post is exactly the bias-vs-variance trade, seen through MSE.

     One honest IOU, named out loud: that three-way split is STATED here, not derived.
     The proof expands (truth - guess)^2 around the average guess and shows the cross
     terms die -- average-of-averages algebra this post does not carry.  Until the blog
     works it by pencil, treat the split as a named promise, not an earned fact.  What
     IS earned here: the three words mean real things you can already see -- a flat
     line on curvy truth (stiff), a wild line that changes with every re-deal (jumpy),
     and the scatter no line removes (noise).

    1. MSE = miss -> square -> mean -> stop. RMSE without the root, in squared
       answer-units, and the very thing the stick shrinks.
    2. A raw miss-size means nothing alone -- "off by 9" needs something to measure
       against.
    3. R^2 races the stick against a fool who only ever shouts the average (the bottom of
       the squared-miss bowl).
    4. R^2 = (total wobble - leftover) / total wobble = 1 - RSS/TSS. No units. Tops out at
       1; can go below 0 on unseen rows.
    5. Guess once, grade twice: the same guesses feed both rulers.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  MSE has NO root -- read the name
       WRONG: mean_squared_error(...) then ** 0.5.
       RIGHT: MSE = miss -> square -> mean -> STOP.  No root.
              The tool already squares-and-means.  Do NOT root it.
              RMSE has the root.  MSE does not.

    TRIPWIRE 2:  leftover is the BAD part, not the good part
       WRONG: "leftover = 2 -- that's the stick's score."
       RIGHT: leftover is the misery STILL LEFT after the stick tried.
              The good part is what the stick ATE: total wobble - leftover.

    TRIPWIRE 3:  R^2 divide by the WHOLE wobble, not the count
       WRONG: R^2 = (total wobble - leftover) / 5  (5 rows).
       RIGHT: R^2 = (total wobble - leftover) / total wobble.
              Divide by the whole bar, not the row count.
              The count cancels, which is why R^2 has no units.

    TRIPWIRE 4:  R^2 always rises when you add a column -- on the
                 training pile only
       WRONG: "Added a noise column and R^2 went up -- better fit."
       RIGHT: The old fit is still a candidate, so it can only do at
              least as well.  Use TEST R^2 or adjusted R^2 to judge.

    TRIPWIRE 5:  R^2 can go negative on the test pile
       WRONG: "R^2 < 0 is a bug."
       RIGHT: On unseen rows, the leftover CAN exceed the wobble.
              R^2 = 1 - RSS/TSS can dip below 0.  Not a bug -- the
              line is doing worse than shouting the average.

    TRIPWIRE 6:  Which pile feeds the rulers
       WRONG: compute MSE and R^2 on the TRAINING pile.
       RIGHT: Both MSE and R^2 grade on the SEALED test pile.
              The training pile was used to set the dials.  Grading
              on it gives a flattering, false-low mistake.

    TRIPWIRE 7:  "Most negative" = MIN, not MAX
       WRONG: -0.04 looks bigger than -0.007, so it's the strongest.
       RIGHT: -0.04 is further below zero = SMALLER = most negative.
              min(dials, key=dials.get) returns the column name.

    TRIPWIRE 8:  Raw dials are NOT comparable across columns
       WRONG: compare -0.007 per pound with -0.04 per horsepower.
       RIGHT: Each dial wears its column's own units.  Put every
              column on one shared ruler first (StandardScaler) for
              an honest comparison.


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  Guess once, then hold both rulers against the same misses. A small helper keeps it
  honest: it predicts a single time and hands back both scores together.

  >> NEW TO PYTHON? One new piece here:
       return (a, b)    -- hand back two values at once, bundled as a pair (a "tuple")

    from sklearn.metrics import mean_squared_error, r2_score

    def evaluate_guesser(model, X_test, y_test):
        guesses = model.predict(X_test)              # guess ONCE
        mse = round(mean_squared_error(y_test, guesses), 3)   # ruler 1: a size
        r2  = round(r2_score(y_test, guesses), 3)             # ruler 2: a slice
        return (mse, r2)                             # truth first, guesses second


  ## The Labels, Last

    Plain term used above                     Standard label
    -------------------------------------     ----------------------------------------
    size of miss, squared and averaged        mean squared error (MSE)
    ...with the root put back                 root mean squared error (RMSE)
    the always-average fool                   the mean / baseline predictor
    total wobble                              total sum of squares (TSS)
    leftover                                  residual sum of squares (RSS)
    what the stick ate                        explained sum of squares (ESS)
    slice of wobble eaten                     coefficient of determination (R^2)
    too stiff vs too jumpy                    bias-variance
    docked for each extra dial                adjusted R^2

----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 2 -- Grading a Guesser):
    Part 1 (this post) .
    Part 2 -- Reading the Dials

  Previous chapter: Chapter 1 -- Predicting House Prices
  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================