Guessing House Prices, End to End

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 1 . PREDICTING HOUSE PRICES . PART 1 OF 3
  Guessing House Prices, End to End (No Labels Until the End)
  Posted: 2026-06-01 . Author: Rahul Rai . Tags: fundamentals, supervised-learning, knn
  ============================================================================================

  PATH . post 1 of 28  (you are at the start -- read top to bottom)
       next:  House Prices 2: Ask the Closest Rows ->

  Welcome -- this is where the blog begins, and it asks nothing of you but curiosity
  and a pencil. No prior knowledge, no jargon you have not earned, no hand-waving past
  the hard parts.

  Here is the promise. By the end of this single post you will have built -- by hand, on
  a real sheet of California house prices -- a machine that guesses the price of a house
  it has never seen. Not a toy version. The same machinery the textbooks dress up in
  Greek letters, except we build it first and name it last. Parts 2 and 3
  then take the two rules apart screw by screw. For now, stand back and see the whole
  thing at once.

  ** KEY: THE 1950 CONTRACT
     Pretend it is 1950.  Print these pages.  Pencils, graph paper, a blackboard --
     and a room of infinite, tireless clerks for the heavy arithmetic.  No computer,
     no calculator, anywhere in the teaching.  Every number is recomputed where it is
     needed (no page assumes you remember the last one), every worked example is
     followed by a YOUR TURN drill for your slate, and every cost is counted in
     clerk-steps.  Your attention is the scarce thing; arithmetic is free.

  >> HOW THIS BOOK IS LAID OUT (read once, then forget)
     This blog is really a short book in six chapters (twenty-one posts), meant to be read top to bottom.
     Every post draws the idea by hand first -- plain words, a pencil sketch, simple
     arithmetic. Any runnable Python is gathered into a "The Code, If You Want It"
     section at the very END of each post, so you can follow the whole story without ever
     stepping over code. That code assumes almost no Python: the handful of things you
     need are explained in one short line the first time they appear. Skip the code or
     study it -- the idea stands on its own either way.


  ## The Job: Guess From a Sheet of Numbers

    rooms  income  age  ocean  people  |  PRICE   <- right-answer column
    --------------------------------------------------------
     4.2    3.5    25    1.0    320    |   1.4
     5.1    5.6    10    0.5    210    |   2.8
     3.8    2.1    42    3.2    890    |   0.9
    +--------- 8 measured columns ----+|  the answer

  20,640 rows. Each row is one California neighbourhood. 8 measured columns. One
  right-answer column: median house price in $100,000s. The job sounds simple and hides
  the whole game: build a rule that guesses the right answer for a row it has NEVER seen
  -- not a row it already studied. Anyone can memorise answers they have been shown. The
  trick is being right about the houses you have not.

  !! WARN: .shape IS A PAIR, NOT A PRODUCT
     df.shape returns (n_rows, n_cols) in that fixed order -- (20640, 9) -- never their
     product.


  ## The One Rule: Hide Some Rows

    ALL ROWS
    ################----
     working pile (80%)   hidden pile (20%)
                          ^ locked away -- read once, at the very end

  Before touching anything else, do the one thing every honest guesser does first: set
  aside a random 20% of rows and lock them in a drawer. Their right answers must not
  touch any decision -- not the rule, not any setting, not even how the numbers are
  prepared. They exist for a single honest check, taken once, at the very end.

  Why be so strict? The reason is mechanical, not moral. Any rule checked on the same
  rows it was built from reports a flattering, false-low mistake -- like grading your own
  exam with the answer key open. The ask-closest rule at k=1 (a rule we build a few
  sections down: guess by copying the answer of the single most similar row) makes ZERO
  mistakes on its own working pile -- every row's most similar row is ITSELF, so it copies
  its own answer -- and then stumbles on new rows. The drawer is the only number that does
  not lie.

  ** KEY: SHUFFLE FIRST
     Rows often arrive sorted by district. Shuffle with a fixed starting point so the
     split is reproducible and not ordered. The fixed point controls reproducibility; the
     shuffle removes order bias.


  ## Rule 1 -- Ask the Closest Rows

  IN HAND: a sheet of 20,640 rows -- 8 measured columns plus one right-answer column
  (price).  Cut 80/20: working pile = 20,640 x 8/10 = 16,512 rows on the desk; hidden
  pile = 20,640 - 16,512 = 4,128 rows locked in the drawer.  Now the first guessing rule.

    new row ?
        |
    measure gap to every working-pile row
        |
    pick the k closest
        |
    average their right-answers   ->   guess

      gap(a, b) = sqrt( sum_j (a_j - b_j)^2 )

  One honest IOU, named out loud: this formula is handed to you here, not derived.
  WHY square-then-root measures distance (it is the ruler-on-graph-paper distance,
  by Pythagoras) and what goes wrong with it when columns wear different units is
  exactly Part 2's first job.  For this post, watch it work on small numbers:

  A concrete 2-column example, by pencil.  New row has (rooms=4, income=3).  Compare
  to one working row (rooms=5, income=3):

    column     a_j (new)   b_j (work)   gap_j = a_j - b_j   gap_j^2
    -----------------------------------------------------------------
    rooms      4           5            -1                   1
    income     3           3             0                   0
                                                     sum = 1

    gap = sqrt(1) = 1

    Now compare the same new row to another working row (rooms=8, income=9):

    column     a_j (new)   b_j (work)   gap_j = a_j - b_j   gap_j^2
    -----------------------------------------------------------------
    rooms      4           8            -4                  16
    income     3           9            -6                  36
                                                     sum = 52

    gap = sqrt(52) ~ 7.21

    The first working row (gap=1) is much closer than the second (gap=7.21).
    If k=1, the guess is the first row's price.  If k=3, we average the 3
    smallest-gap rows' prices.

  >> YOUR TURN (one clerk, one slate)
     Same new row (rooms=4, income=3).  A third working row sits at (rooms=4,
     income=7).  Compute its gap before reading on.

     check your slate:  rooms 4-4 = 0, square 0;  income 3-7 = -4, square 16;
     sum 0 + 16 = 16;  gap = sqrt(16) = 4.  Of the three working rows the closest
     is still the first (1 < 4 < 7.21) -- at k=1 the guess is that row's price.

  This rule barely deserves the name "machine" -- it builds nothing ahead of time. It
  just keeps the whole working pile on a shelf and, when a new house walks in, finds the
  ones most like it and averages what they sold for. The one knob to set by hand is k:
  how many neighbours to ask. Small k -> too jumpy; large k -> too stiff. The full
  derivation, and how to find the sweet spot, is in Part 2.

  Count the clerk-steps for ONE guess, all 8 columns this time: per working row, 8
  subtractions, 8 squarings, 7 additions, 1 root = 24 strokes; times 16,512 working
  rows = 16,512 x 24 = 396,288 strokes -- call it four hundred thousand pencil strokes
  for a single new house.  A room of clerks does it by lunch, but remember the bill:
  it comes due every time another house walks in.


  ## Measuring How Wrong: RMSE

  A guesser is only as good as its misses are small -- so before trusting any rule we
  need an honest way to size up how wrong it is. Leftover for one row: r_i = y_i - yhat_i
  (right answer minus guess).

      RMSE = sqrt( (1/n) sum_i r_i^2 )

  A concrete 4-person worked example, by pencil:

    person   truth y    guess yhat    miss r     r^2
    --------------------------------------------------
      A       1.4       1.2          +0.2       0.04
      B       2.8       3.1          -0.3       0.09
      C       0.9       1.0          -0.1       0.01
      D       3.5       3.2          +0.3       0.09
                                       sum r^2 = 0.23

    mean r^2 = 0.23 / 4 = 0.0575
    RMSE = sqrt(0.0575) ~ 0.24

    Typical miss: about $24,000.  The bar to beat is the FOOL who always shouts the
    average.  Work the fool out on the same 4 people.  His one guess: average y =
    (1.4+2.8+0.9+3.5)/4 = 8.6/4 = 2.15, shouted four times:

    person   truth y    fool's guess    miss r     r^2
    ----------------------------------------------------
      A       1.4       2.15           -0.75      0.5625
      B       2.8       2.15           +0.65      0.4225
      C       0.9       2.15           -1.25      1.5625
      D       3.5       2.15           +1.35      1.8225
                                        sum r^2 = 4.3700

    mean r^2 = 4.37 / 4 = 1.0925        fool's RMSE = sqrt(1.0925) ~ 1.05

    Notice what we just computed: squared gaps from the average, averaged, rooted --
    that is the SPREAD of the truth column itself.  The fool's RMSE and std(y) are
    the same recipe; they cannot help but be the same number.  (Set the clerks on the
    full 20,640-row column with these exact steps and the fool lands near 0.97.)
    Our rule at RMSE = 0.24 beats the fool four times over.

  >> YOUR TURN
     Person E walks in: truth 2.0, our rule guesses 2.4.  Add him to OUR table
     (sum r^2 was 0.23 over 4 people) and recompute the rule's RMSE.

     check your slate:  miss r = 2.0 - 2.4 = -0.4;  r^2 = 0.16;  new sum = 0.23 +
     0.16 = 0.39;  mean = 0.39 / 5 = 0.078;  RMSE = sqrt(0.078) ~ 0.28.  One sloppy
     guess dragged the typical miss from $24,000 up to $28,000 -- squaring makes big
     misses shout.

  RMSE punishes large single misses more heavily than the plain average of absolute
  leftovers, MAE = mean|r_i|.  Check: person B missed by -0.3, squared to 0.09.
  A single double-sized miss of -0.6 would square to 0.36 -- four times the weight.


  ## Rule 2 -- The Straight-Stick Rule

  IN HAND: one guessing rule (ask-closest: measure gaps, average the k nearest answers)
  and one honest ruler (RMSE: miss -> square -> mean -> root).  Now a second rule of the
  opposite temperament.

    d+1 dials:  [nudge b0,  dial1 b1,  dial2 b2, ...]

    guess = b0 + b1*(rooms) + b2*(income) + ...
          = X*beta   (one dot-product per row)

  The second rule is the opposite temperament. Where the ask-closest rule hoards every
  row, this one squeezes the whole working pile into just d+1 dials, throws the pile away,
  and from then on each guess is a single dot-product. The dials are not searched for --
  they drop out of an exact formula. Full derivation in
  Part 3.


  ## Picking Settings Without Peeking at the Hidden Pile

  IN HAND: two rules (ask-closest with its knob k; straight-stick with its d+1 dials),
  one ruler (RMSE), and one drawer that must stay shut.  Left to do: pick k -- without
  the drawer.

    working pile split into 5 chunks:  A  B  C  D  E

    round 1  [#]  .    .    .    .    -> mistake1
    round 2  .   [#]   .    .    .    -> mistake2
    round 3  .    .   [#]   .    .    -> mistake3
    round 4  .    .    .   [#]   .    -> mistake4
    round 5  .    .    .    .   [#]   -> mistake5
                               average -> honest estimate

  Each round, one chunk grades the rule built on the other four. Rotate until every chunk
  has graded once. Average the five mistakes. This gives an honest estimate without ever
  touching the hidden pile.

  !! WARN: PREPARE NUMBERS INSIDE EACH ROUND
     Any step that learns from the numbers (an average, a spread) must be done on each
     round's building portion only. Doing it on the whole working pile before splitting
     leaks grading-chunk numbers into the build step. A bound pipeline enforces this
     automatically (Part 2).

  That single discipline -- never let the drawer influence a choice -- is the spine the
  whole blog hangs on. Procedure: pick settings using rotating folds on the working pile
  -> refit the winner on all of the working pile -> read the hidden pile ONCE -> report
  that number, and live with it.

  >> NOTE: ISN'T CHECKING THE HIDDEN PILE PEEKING?
     No -- using it to pick among rules is peeking; using it ONCE to report the chosen
     rule's mistake is exactly what it is for. A gap between the rotating-fold estimate
     and the hidden-pile mistake means the rule memorised the working pile a little.


  ## The Whole Blog in Six Lines

  Everything above, boiled down to six lines. Pin them somewhere; every later post is one
  of these lines, slowed down and worked out in full.

    1. Find a rule that guesses right on rows never seen; measure on the hidden pile only.
    2. Ask-closest rule -- store the pile, look it up; put columns on the same ruler;
       pick k.
    3. Straight-stick rule -- d+1 dials, exact formula, one dot-product per guess.
    4. Measure mistake with RMSE (same units as price); the bar to beat is std(y).
    5. Pick settings by rotating folds, preparing numbers inside each round.
    6. Refit on all of working pile; read the hidden pile exactly once.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  .shape is a PAIR, not a product
       WRONG: treat 20640 x 9 as 185760 (the product).
       RIGHT: df.shape returns (n_rows, n_cols) in that fixed order --
              (20640, 9) -- never their product.

    TRIPWIRE 2:  random_state must be the SAME number everywhere
       WRONG: use random_state=1 here and random_state=42 there.
       RIGHT: use RANDOM_STATE = 42 everywhere.  Same seed -> same shuffle
              -> your splits match the grader's splits.

    TRIPWIRE 3:  k (neighbour count) and fold count (grading rounds) are
                 unrelated
       WRONG: "k=25 with 5 folds sounds wrong."
       RIGHT: k is how many neighbours to ask; 5 folds is how many chunks
              to grade with.  Different jobs, different numbers.

    TRIPWIRE 4:  "Most negative" = MIN, not MAX
       WRONG: -0.04 looks bigger than -0.007, so pick -0.007.
       RIGHT: -0.04 is further below zero = SMALLER = most negative.
              min(dials, key=dials.get) returns the NAME of the smallest.

    TRIPWIRE 5:  The toolbox maximises, so it negates the mistake
       WRONG: pick the candidate with the highest raw cross_val_score.
       RIGHT: cross_val_score(..., scoring='neg_root_mean_squared_error')
              returns NEGATIVE numbers.  The "biggest" (closest to zero) is
              the smallest mistake.  Un-negate with -scores.mean().


  ## The Labels, Last

  Now -- and only now that you have built the thing -- here are the proper names. Every
  post on this blog ends this way: plain words on the left, the textbook term on the
  right. Practice it first; the label is just a handle to carry it by.

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    sheet of written-down numbers         dataset / design matrix X
    one measured column                   feature
    right-answer column                   target / label y
    working pile                          training set
    hidden pile                           test set
    store the pile, look it up            non-parametric / instance-based
    ask-closest rule                      k-nearest neighbours (KNN)
    straight-stick rule                   linear regression (OLS)
    dials                                 weights / coefficients beta
    fixed nudge                           intercept / bias beta0
    put columns on the same ruler         standardisation / normalisation
    leftover / mistake (RMSE)             residual / loss / error
    rotating folds                        k-fold cross-validation
    memorising the working pile           overfitting
    mistake on rows never seen            generalisation error

  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same six steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       fetch_california_housing()  -- loads the sheet (no file needed)
       train_test_split()          -- carves off the hidden 20%
       StandardScaler()            -- puts every column on the same ruler
       KNeighborsRegressor(k)      -- the ask-closest machine
       LinearRegression()          -- the straight-stick machine
       cross_val_score()           -- rotating-folds scorer (returns negated RMSE)
       np.sqrt()                   -- square root

    import numpy as np
    from sklearn.datasets import fetch_california_housing
    from sklearn.model_selection import train_test_split, cross_val_score, KFold
    from sklearn.preprocessing import StandardScaler
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.linear_model import LinearRegression
    from sklearn.pipeline import Pipeline

    RANDOM_STATE = 42   # same seed everywhere so your splits match mine

    # --- load the sheet ---
    data = fetch_california_housing()
    X, y = data.data, data.target     # X is (20640, 8), y is (20640,)
    print(X.shape, y.shape)

    # --- carve off the hidden 20% ---
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE)
    print(f"working pile: {X_train.shape[0]} rows")
    print(f"hidden pile:  {X_test.shape[0]} rows")

    # --- the always-average fool (baseline) ---
    baseline_rmse = np.std(y_train)
    print(f"baseline RMSE (fool): {baseline_rmse:.4f}")

    # --- ask-closest rule ---
    knn_pipe = Pipeline([
        ("scale",  StandardScaler()),
        ("knn",    KNeighborsRegressor(n_neighbors=25))
    ])
    knn_pipe.fit(X_train, y_train)
    knn_preds = knn_pipe.predict(X_test)
    knn_rmse  = np.sqrt(np.mean((y_test - knn_preds)**2))
    print(f"KNN RMSE (test):   {knn_rmse:.4f}")

    # --- straight-stick rule ---
    lr_pipe = Pipeline([
        ("scale", StandardScaler()),
        ("lr",    LinearRegression())
    ])
    lr_pipe.fit(X_train, y_train)
    lr_preds = lr_pipe.predict(X_test)
    lr_rmse  = np.sqrt(np.mean((y_test - lr_preds)**2))
    print(f"Linear RMSE (test): {lr_rmse:.4f}")

    # --- pick k by rotating folds (never touch the hidden pile here) ---
    cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
    for k in [5, 10, 25, 50]:
        pipe = Pipeline([("scale", StandardScaler()),
                         ("knn",   KNeighborsRegressor(n_neighbors=k))])
        scores = cross_val_score(pipe, X_train, y_train,
                                 cv=cv,
                                 scoring="neg_root_mean_squared_error")
        # scores are negative -- closest to zero is the smallest mistake
        print(f"k={k:3d}  CV RMSE = {-scores.mean():.4f}")


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 1 -- Predicting House Prices):
    Part 1 (this post) .
    Part 2 -- Ask-Closest Rule .
    Part 3 -- Straight-Stick Rule

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================