The Straight-Stick Rule :: Rahul's ML Blog

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 1 . PREDICTING HOUSE PRICES . PART 3 OF 3
  The Straight-Stick Rule: Setting the Dials, Checking the Leftovers, Getting Numbers Right
  Posted: 2026-06-01 . Author: Rahul Rai . Tags: linear-regression, ols, least-squares
  ============================================================================================

  PATH . post 3 of 28
    <- prev:  House Prices 2: Ask the Closest Rows
       next:  Grading 1: Two Rulers, MSE and R^2 ->

  The ask-closest rule never really learned anything -- it just kept the answer key handy.
  This rule is its opposite, and its better-behaved sibling. It actually studies the pile,
  distils it down to a handful of dials, and then throws the pile away. From that point on
  it carries no baggage: every guess is a quick weighted sum of the columns plus a fixed
  nudge.

  And here is the part that still feels like magic the first time you see it. The best
  possible dials are not hunted for by trial and error -- they fall straight out of one
  exact formula. No searching, no luck. This post derives that formula by hand on three
  rows, then shows it is the very same arithmetic the toolbox runs on sixteen thousand.


  ## The Rule

    one row of measured things:  [rooms, income, age, ...]
          x
    one dial per column:         [b1,    b2,    b3,  ...]
          +  fixed nudge b0
          =  guess yhat

    all rows at once:  yhat = X*beta
    (X has a column of 1s prepended to carry b0)

      yhat_i = b0 + sum_j beta_j * x_ij      <=>      yhat = X*beta

  Each dial beta_j is the change in the guess per unit change in column j, all other
  columns held fixed. The columns can be squares, logs, or products of raw columns -- the
  rule stays linear in the dials.


  ## The Squared Leftover to Shrink

      \         /
       \       /        height  = total squared leftover J(beta)
        \     /         bottom  = the best dials beta
         \ . /

  Pick the dials that shrink the squared leftover sum:

      J(beta) = ||X*beta - y||^2 = sum_i (yhat_i - y_i)^2

  J is bowl-shaped -- its second-slope (Hessian) 2*X^T X is never-negative -- so any flat
  point is the global bottom. No local traps, no searching.


  ## Finding the Dials: The Flat-Point Equations

  Set the slope of the bowl to zero:

      grad J = 2*X^T (X*beta - y) = 0   =>   X^T X beta = X^T y   =>   beta = (X^T X)^-1 X^T y

  Worked by hand at one column (d = 1) on three rows.  First, build the raw table:

    x = 1, 2, 3     y = 2, 3, 5

    row   x    y    x^2    x*y
    ---------------------------
    1     1    2    1      2
    2     2    3    4      6
    3     3    5    9     15
    ---------------------------
    sum   6   10   14     23

    X has a column of 1s for the nudge, plus the x column:
      X = [ 1 1     X^T X = [ 3   6      X^T y = [ 10
            1 2              6  14 ]             23 ]
            1 3 ]

    The system X^T X beta = X^T y:

      3*b  +  6*w  =  10     (equation 1)
      6*b  + 14*w  =  23     (equation 2)

    Solve by elimination.  Multiply equation 1 by 2:

      6*b  + 12*w  =  20     (equation 1, doubled)

    Subtract from equation 2:

      (6*b + 14*w) - (6*b + 12*w) = 23 - 20
      2*w  =  3
      w  =  3/2  =  1.5      <- the dial on x

    Put w back into equation 1:

      3*b  +  6*(3/2)  =  10
      3*b  +  9        =  10
      3*b  =  1
      b  =  1/3  ~  0.333   <- the fixed nudge

    stick:  yhat = 1/3 + (3/2)*x
            yhat = 0.333 + 1.5*x

    Check at each x:
      x=1:  yhat = 0.333 + 1.5*1 = 1.833
      x=2:  yhat = 0.333 + 1.5*2 = 3.333
      x=3:  yhat = 0.333 + 1.5*3 = 4.833

  >> YOUR TURN
     New pile (made-up), same recipe:  x = 1, 2, 3 and y = 1, 3, 5.  Build the four
     sums, write the two flat-point equations, and solve for the dials.

     check your slate:  sum x = 1+2+3 = 6;  sum y = 1+3+5 = 9;  sum x^2 = 1+4+9 = 14;
     sum x*y = 1 + 6 + 15 = 22.  Equations:  3*b + 6*w = 9  and  6*b + 14*w = 22.
     Double the first: 6*b + 12*w = 18;  subtract: 2*w = 22 - 18 = 4, so w = 2;  put
     back: 3*b + 6*2 = 9, 3*b = 9 - 12 = -3, b = -1.  Stick: yhat = -1 + 2*x, which
     hits 1, 3, 5 exactly -- every leftover is zero; this stick runs straight through
     all three points.


  ## Checking the Leftovers

  IN HAND: three rows (x = 1, 2, 3 against y = 2, 3, 5) and the dials the flat-point
  solve dropped out: 2*w = 3 so w = 3/2 = 1.5, then 3*b = 10 - 6*1.5 = 10 - 9 = 1 so
  b = 1/3 ~= 0.333.  The stick reads yhat = 0.333 + 1.5*x.  This section adds the test
  every honest fit must pass: what its leftovers obey.

  At the best dials, X^T (y - X*beta) = 0, written X^T e = 0: the leftover column sits at
  right-angles to every measured column. Because one column is all ones, the leftovers
  also sum to zero.

    x=1: yhat=1.833, leftover = +0.167
    x=2: yhat=3.333, leftover = -0.333
    x=3: yhat=4.833, leftover = +0.167

    sum leftover        = 0  ok   (right-angles to the ones column)
    sum x * leftover    = 0  ok   (right-angles to x column)
    RMSE, worked: the exact leftovers are 1/6, -1/3, 1/6;  squares 1/36 + 4/36 + 1/36
    = 6/36 = 1/6;  mean = (1/6) / 3 = 1/18 ~= 0.0556;  RMSE = sqrt(1/18) ~= 0.236

  >> YOUR TURN
     Verify the second right-angle check on your own slate: multiply each leftover by
     its x and add.  Use the exact fractions: leftovers 1/6, -1/3, 1/6.

     check your slate:  1*(1/6) = 1/6;  2*(-1/3) = -2/3 = -4/6;  3*(1/6) = 3/6;
     sum = (1 - 4 + 3)/6 = 0/6 = 0.  The leftover column sits at right-angles to the
     x column: the stick has already squeezed out everything x can explain.

  Geometrically: yhat = H*y where the flat-shadow thrower H = X (X^T X)^-1 X^T projects y
  onto the flat sheet spanned by the measured columns. H is symmetric and squaring it
  gives itself back (H^2 = H). trace(H) equals the number of fitted dials -- here the
  nudge b plus the dial w, 1 + 1 = 2. No point on the flat sheet sits closer to y.

  ** KEY: SAME WORK, BIGGER SHEET
     The three-row pencil solve is identical to what the toolbox runs on 16,000-odd rows
     (the working pile: 20,640 x 8/10 = 16,512) and 8 columns -- only the size of X^T X
     changes. Fitted dials land in m.coef_, the fixed
     nudge in m.intercept_.


  ## When the Straight-Stick Rule Is the Best Possible

  The Gauss-Markov result: if the leftover has zero average given the rows (E[eps|X] = 0),
  the same spread everywhere, leftovers do not echo each other, and X has full column rank
  -- then the straight-stick rule is BLUE: the lowest-spread rule among all flat, honest
  rules.

    - Bell-curve shape of leftovers is NOT required for BLUE. You need it only for exact
      confidence ranges and t-tests, not for the rule to be the best flat rule.
    - Full column rank means no measured column is a recipe of the others. If one is,
      X^T X is flat (no single solution) and (X^T X)^-1 does not exist.


  ## Getting the Dials Right: Don't Flip the Sheet

  IN HAND: the exact formula beta = (X^T X)^-1 X^T y, proved bowl-bottomed and checked
  by hand -- three rows gave w = 3/2 = 1.5 and b = 1/3 ~= 0.333, with leftovers
  1/6 - 2/6 + 1/6 = 0 summing to zero.  This section adds the bill for the solve, and
  the safe way to run it.

    build X^T X: O(n*d^2)   <- n rows, each d^2 multiplications
    solve for beta: O(d^3)
    total: O(n*d^2 + d^3)

  Pencil that bill in clerk-steps on our pile.  With the ones column, d = 8 + 1 = 9, so
  each row costs 9 x 9 = 81 multiplications; the working pile's 20,640 x 8/10 = 16,512
  rows cost 16,512 x 81 = 1,337,472 strokes to build X^T X, and the solve adds about
  9 x 9 x 9 = 729 more.  A room of clerks clears it in a day or two -- ONCE.  After that
  every guess is 8 multiplications + 8 additions = 16 strokes: one clerk, one breath.

  Do NOT compute (X^T X)^-1 by flipping the sheet directly. Forming X^T X squares the
  shakiness number (kappa(X^T X) = kappa(X)^2) and loses precision. Instead:

    - Solve the flat-point equations directly (Cholesky cut of X^T X), or better,
    - Cut X itself with QR or SVD and solve without ever forming X^T X. The toolbox's
      LinearRegression uses an SVD-based approach.
    - For very many rows or a stream, roll downhill a little at a time: O(n*d) per step
      instead of a full cut.
    - If X^T X is flat or shaky (one column is a near-recipe of another), add a small bump
      to the diagonal -- the ridge rule: beta = (X^T X + lambda*I)^-1 X^T y -- solvable
      for any lambda > 0.

  >> NOTE: SAME-RULER CONTRAST WITH ASK-CLOSEST RULE
     The straight-stick rule does NOT care which ruler you used. Rescale a column and its
     dial rescales inversely -- guesses and RMSE stay the same. (The ask-closest rule
     breaks without same-ruler treatment.) You DO need same-ruler when rolling downhill or
     when adding a bump, because those care about the size of the dials, which depends on
     scale.


  ## Ask-Closest vs. Straight-Stick

  IN HAND: both rules of this chapter, built and checked.  Ask-closest keeps the whole
  working pile (20,640 x 8/10 = 16,512 rows) and pays 16,512 x 24 = 396,288 strokes per
  guess; straight-stick keeps 8 + 1 = 9 dials and pays 8 multiplications + 8 additions
  = 16 strokes per guess.  This section adds the two temperaments side by side.

                          Ask-Closest Rule (KNN)      Straight-Stick Rule (OLS)
    ------------------    ------------------------    --------------------------
    What it stores        whole pile                  d+1 dials only
    Build cost            O(1) (store only)           O(n*d^2 + d^3) (solve)
    Guess cost            O(n*d) per row              O(d) per row
    Setting by hand       k (by rotating folds)       none (lambda if adding bump)
    Same-ruler needed?    yes -- required             no -- not for plain rule
    Guess surface         piecewise-flat, local       one flat sheet, global

  Two rules, opposite temperaments, one question: given the working rows, guess the right
  answer for a new one. The ask-closest rule leans on stored neighbours and does its
  thinking at the last second; the straight-stick rule does its thinking up front and
  compresses everything into one flat sheet. Neither is "the" answer -- which to reach for
  depends on the pile in front of you, and that judgement is the real craft.


  ## Going Deeper

    - flat-point equations, ordinary least squares -- the solve
    - QR / SVD least squares, shakiness number (condition number) -- the numerics
    - flat-shadow thrower (hat matrix), right-angle leftovers -- the geometry
    - Gauss-Markov, uneven miss spread -- the assumptions
    - ridge, lasso -- when columns are near-recipes of each other

    1. Rule: yhat = X*beta; set dials by shrinking ||X*beta - y||^2.
    2. Bowl-shaped mistake -> exact dials from beta = (X^T X)^-1 X^T y.
    3. Leftovers sit at right-angles to measured columns; yhat = H*y is a flat-sheet
       shadow.
    4. Straight-stick rule is BLUE under Gauss-Markov; bell-curve leftovers needed only
       for tests.
     5. Cut the sheet (QR/SVD), don't flip it; add a bump when X^T X is flat.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  MSE has NO root in Lab 2
       WRONG: tack on ** 0.5 to mean_squared_error(...).
       RIGHT: MSE = miss -> square -> mean -> STOP.  No root.
              RMSE has the root.  MSE does not.  Read the name.

    TRIPWIRE 2:  leftover is the BAD part, not the good part
       WRONG: "leftover = 2, that's the stick's score."
       RIGHT: leftover is the misery STILL LEFT after the stick tried.
              The good part is total wobble - leftover.
              R^2 = (total wobble - leftover) / total wobble.

    TRIPWIRE 3:  R^2 divide by the WHOLE BAR, not the count
       WRONG: R^2 = (total wobble - leftover) / 5  (row count).
       RIGHT: R^2 = (total wobble - leftover) / total wobble.
              The count cancels, which is WHY R^2 has no units.

    TRIPWIRE 4:  R^2 always rises when you add a column -- on the training
                 pile only
       WRONG: "I added a noise column and R^2 went up -- the machine is better."
       RIGHT: On the training pile, the old fit is still available, so it can
              only do at least as well.  Use the TEST R^2 or adjusted R^2.

    TRIPWIRE 5:  R^2 can go negative on the test pile
       WRONG: "R^2 < 0 must be a bug."
       RIGHT: On unseen rows, no right-angle guarantee holds.  The leftover
              CAN exceed the wobble.  R^2 = 1 - RSS/TSS can dip below 0.

    TRIPWIRE 6:  "Most negative" dragging column = MIN, not MAX
       WRONG: -0.04 is bigger than -0.007 by the number 4 vs 7.
       RIGHT: -0.04 is further below zero = SMALLER = most negative.
              min(dials, key=dials.get) returns the column NAME.

    TRIPWIRE 7:  Raw dials are NOT comparable across columns (different
                 units)
       WRONG: compare -0.007 per pound directly with -0.04 per horsepower.
       RIGHT: Put every column on one shared ruler first (StandardScaler).
              Then each dial means "answer-units per one-spread step."


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  The whole pencil derivation above is what the toolbox does for you in four lines. Notice
  there is no setting to tune and no rotating-fold hunt here -- the dials come straight from
  the exact formula, so LinearRegression simply solves and stops.

  >> NEW TO PYTHON? One new piece here:
       thing.coef_      -- a value the object stores after being fitted; sklearn marks
                           every learned-from-data value with a trailing underscore

    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error

    m = LinearRegression()          # SVD-based flat-point solve under the hood
    m.fit(X_train, y_train)         # sets dials: X^T X beta = X^T y
    pred = m.predict(X_test)        # one dot-product per row

    rmse = mean_squared_error(y_test, pred) ** 0.5
    print(round(rmse, 3), m.coef_, m.intercept_)

----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 1 -- Predicting House Prices):
    Part 1 -- The Full Picture .
    Part 2 -- Ask-Closest Rule .
    Part 3 (this post)

  Next chapter: Chapter 2 -- Grading a Guesser (MSE, R^2, reading the dials)
  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================