==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 1 . PREDICTING HOUSE PRICES . PART 2 OF 3
  Ask the Closest Rows: Gap Formula, Same-Ruler Rule, and How Slow It Gets
  Posted: 2026-06-01 . Author: Rahul Rai . Tags: knn, distance, non-parametric, complexity
  ============================================================================================

  PATH . post 2 of 28
    <- prev:  House Prices 1: Guessing Prices End to End
       next:  House Prices 3: The Straight-Stick Rule ->

  Here is the oldest trick in the book, and one you already use every day: to guess
  something about a stranger, look at the people most like them. Want to know what a house
  is worth? Find the houses most similar and see what they sold for. That is the whole
  idea, and it barely needs a machine.

  The ask-closest rule guesses by averaging the k closest working-pile right-answers.
  Nothing is built ahead of time -- the whole pile is kept on a shelf and looked up each
  time. Three honest steps -- measure the gap, pick the closest, average their answers.
  But two quiet details decide whether it works at all: put every column on the same
  ruler, and choose k with care. Get those wrong and the rule falls apart; get them right
  and it is hard to beat.


  ## Working Pile and Hidden Pile

    WORKING PILE (rows with answers)        HIDDEN PILE (answers withheld)
    rooms income age pop | price            rooms income age pop | price
    -----------------------------           -----------------------------
      4    3.2   6   320 |  3.4               3    3.0   5  410 |  ?
      8    1.5   1   880 |  1.5              ...                 |  ?

  Gaps are measured from the measured columns only. Hidden right-answers are read once,
  at scoring -- never during guessing.


  ## Measuring the Gap

  Straight-line gap over all measured columns:

      gap(a, b) = sqrt( sum_j (a_j - b_j)^2 )

  This is p = 2 of the general rule ( sum_j |a_j - b_j|^p )^(1/p). At p = 1 it counts
  straight city-block steps (add up the plain gaps column by column, no squaring or root --
  like walking a grid of streets, never cutting diagonally). Worked pair -- new row (3, 2, 5, 4):

    row A: (4, 2, 6, 3)   gaps (1, 0, 1, 1)   gap = sqrt(1+0+1+1)    = sqrt(3)   ~= 1.73
    row B: (8, 9, 1, 9)   gaps (5, 7, 4, 5)   gap = sqrt(25+49+16+25)= sqrt(115) ~= 10.72

  >> YOUR TURN
     Same new row (3, 2, 5, 4).  A row C sits at (3, 4, 5, 2) (made-up).  Compute its
     gap before reading on.

     check your slate:  gaps 3-3 = 0, 2-4 = -2, 5-5 = 0, 4-2 = 2;  squares 0, 4, 0, 4;
     sum 0 + 4 + 0 + 4 = 8;  gap = sqrt(8) ~= 2.83.  Row C lands between A (sqrt(3) ~=
     1.73) and B (sqrt(115) ~= 10.72) -- a middling neighbour, nowhere near the closest.


  ## Put Columns on the Same Ruler First

  IN HAND: one gap rule -- subtract column by column, square, add, root -- worked on a
  new row (3, 2, 5, 4): row A sat at sqrt(1+0+1+1) = sqrt(3) ~= 1.73, row B at
  sqrt(25+49+16+25) = sqrt(115) ~= 10.72, so A is the neighbour.  This section adds the
  trap hiding inside that sum: columns measured on different rulers.

  Here is the trap that sinks most first attempts. The gap adds up squared per-column
  differences -- and a column measured in thousands (people) simply drowns out one
  measured in single digits (rooms). Not because population matters more, but because its
  numbers are bigger. An accident of units masquerades as importance. On the raw
  California pile, the proud "8-column" rule is secretly a 1-column rule that only ever
  listens to population.

    raw columns:
      rooms  ~   2 - 12     <- tiny gaps
      people ~ 100 - 35000  <- huge gaps  <- this column runs the whole show

  Fix: shift each column to zero average and stretch it to spread of 1, using numbers
  from the working pile only:

      z_j = (x_j - mu_j) / sd_j

  A concrete 3-row, 2-column example, by pencil.  Working pile has 3 rows:

    row       rooms (x1)    people (x2)
    ------------------------------------
      P           4           320
      Q           5           210
      R           6           890

    Column averages (mu):
      mu1 = (4 + 5 + 6) / 3     = 15/3 = 5
      mu2 = (320 + 210 + 890) / 3 = 1420/3 ~ 473.3

    Column spreads (sd -- how far each column's numbers typically sit from its average):
      sd1 = sqrt( ((4-5)^2 + (5-5)^2 + (6-5)^2) / 3 )
          = sqrt( (1 + 0 + 1) / 3 )  = sqrt(2/3)  ~ 0.816
      sd2 = sqrt( ((320-473.3)^2 + (210-473.3)^2 + (890-473.3)^2) / 3 )
          = sqrt( (23509 + 69389 + 173609) / 3 )
          = sqrt( 266507 / 3 ) = sqrt(88836) ~ 298.1

    Now shift each row:
      row P:  z1 = (4 - 5) / 0.816    = -1.225
              z2 = (320 - 473.3) / 298.1 = -0.514
      row Q:  z1 = (5 - 5) / 0.816    = 0
              z2 = (210 - 473.3) / 298.1 = -0.884
      row R:  z1 = (6 - 5) / 0.816    = 1.225
              z2 = (890 - 473.3) / 298.1 = 1.398

    After scaling, every column is centred at 0 and stretches to roughly -1.4 to +1.4.
    Neither column dominates the gap by accident of its raw unit size.

  Compute mu_j, sd_j on the working pile; apply the same constants to the hidden pile.
  After this, every column contributes equally to the gap. (Squeezing to [0,1] instead of
  unit spread is an alternative that works the same way.)

  !! WARN: LEARN THE RULER ON THE WORKING PILE ONLY
     Computing averages and spreads on the whole pile before splitting leaks hidden-pile
     numbers into the build step. Learn on working, apply to both. Inside rotating folds,
     relearn per round -- the bound pipeline does this.


  ## Guessing One Row

    1. measure gap to every working-pile row          O(n*d)
    2. pick the k smallest gaps                        O(n) with a heap
    3. average those k right-answers  ->  guess

  An optional refinement weights closer rows more -- inverse-gap weights w_i = 1/gap_i:

      guess = ( sum_i w_i * y_i ) / ( sum_i w_i )

  Equal weights (w_i = 1) recover the plain average.


  ## How Slow Is It

  The ask-closest rule is generous at study time and stingy at guess time -- it does no
  work up front, then pays for it on every single question. The pile is stored, never
  compressed, so every guess recomputes all gaps from scratch.

    100 rows = 80 working + 20 hidden
    work per hidden row  = 80 * d  gap computations
    total work           = 20 * 80 * d
    time grows like people times columns

  Count that bill in pencil strokes.  In the toy above, one gap over d = 4 columns costs
  4 subtractions, 4 squarings, 3 additions, 1 root = 12 strokes; 20 hidden rows x 80
  working rows = 1,600 gaps, and 1,600 x 12 = 19,200 strokes -- one clerk's morning.  On
  the full pile (working = 20,640 x 8/10 = 16,512 rows; hidden = 20,640 - 16,512 = 4,128),
  one guess over 8 columns costs 8 + 8 + 7 + 1 = 24 strokes per row, 16,512 x 24 = 396,288
  in all -- and grading every hidden row runs to 4,128 x 396,288 = 1,635,876,864 strokes.
  The room of clerks does NOT finish by lunch.

  Sorting shelves (KD-tree, Ball-tree) cut average guess time to roughly
  "time grows like columns times log of people" when
  columns are few, but degrade toward brute force as columns pile up. Here is why:

    few columns:   nearby rows are truly nearby  <- sorting shelf helps
    many columns:  all rows sit about the same gap away
                   "closest" stops meaning anything


  ## Worked by Hand

  IN HAND: the gap rule (subtract column by column, square, add, root) and the same-ruler
  fix z = (x - mu) / sd, with mu and sd learned on the working pile only.  Each guess
  costs one gap per working row, then the average of the k closest right-answers.  This
  section adds the whole rule, run end to end on a pile small enough for one slate.

  One column (income) only, so the same-ruler step is moot here. New row at income 5.0,
  k = 3:

    working pile (income, price):
      (5.6, 3.4)  (4.0, 2.7)  (3.8, 2.4)  (2.5, 1.5)  (7.3, 3.5)

    gaps from 5.0:
      |5.0 - 5.6| = 0.6 <-   |5.0 - 4.0| = 1.0 <-   |5.0 - 3.8| = 1.2 <-
      |5.0 - 2.5| = 2.5      |5.0 - 7.3| = 2.3

    3 closest right-answers: 3.4, 2.7, 2.4
    guess = (3.4 + 2.7 + 2.4) / 3 = 8.5 / 3 ~= 2.83

  >> YOUR TURN
     Same five-row pile.  A new row walks in at income 3.0 (made-up), k = 3.  Work the
     gaps, pick the 3 closest, average their prices.

     check your slate:  gaps |3.0-5.6| = 2.6, |3.0-4.0| = 1.0, |3.0-3.8| = 0.8,
     |3.0-2.5| = 0.5, |3.0-7.3| = 4.3;  3 smallest gaps 0.5, 0.8, 1.0 -> prices 1.5,
     2.4, 2.7;  guess = (1.5 + 2.4 + 2.7) / 3 = 6.6 / 3 = 2.2.  Move the question and
     the guess moves with it -- the rule simply follows whoever its neighbours are.


  ## k Is the Stiff-vs-Jumpy Dial

    k = 1       asks 1 row     -> copies its noise exactly
                zero mistake on working pile
                bad on new rows   (too jumpy)

    k = large   asks many rows -> smooths everything out
                always a bit wrong, everywhere  (too stiff)
                at k = n_working, every guess = average price

    right k  ->  somewhere in between -- found by rotating folds


  ## Picking k Without Touching the Hidden Pile

  IN HAND: the full rule, run once by hand -- five working rows, new row at income 5.0,
  three closest answers 3.4, 2.7, 2.4, guess (3.4 + 2.7 + 2.4) / 3 = 8.5 / 3 ~= 2.83 --
  and one loose knob, k: small k too jumpy, large k too stiff.  This section adds the
  honest way to set that knob.

  How do you find the right k without ever opening the hidden pile? The same rotating-folds
  trick from Part 1: try each candidate k, score it by rotating folds on the working pile,
  and keep the winner. Because k sets how jumpy the rule is, the fold scores fall to a
  sweet spot and then climb again -- and that dip is your k. The hunt itself, in code, is
  waiting at the end of the post.


  ## Origin

  The rule feels too simple to be serious, so it is worth knowing it has a pedigree.
  Evelyn Fix and Joseph Hodges wrote it down in 1951. Sixteen years later Thomas Cover and
  Peter Hart proved something startling about it: as the pile grows without bound, even
  the crude one-closest-row version makes at most TWICE as many mistakes as the best
  possible rule on Earth. Ask a single neighbour and you are already halfway to perfect.
  Not bad for a rule that does no thinking at all.

    1. Guess = average of the k closest working right-answers (optionally weighted).
    2. The gap adds squared per-column differences -- put columns on the same ruler,
       learning it on the working pile only.
    3. No building ahead; each guess costs O(n*d) and stores the whole pile.
    4. k sets the stiff-vs-jumpy balance; pick it by rotating folds.
     5. When columns pile up, close stops meaning close.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  Shape slot -- rows vs columns
       WRONG: X.shape[1] gives rows.
       RIGHT: shape = (rows, cols).  X.shape[0] = rows, X.shape[1] = columns.
              Counting features = columns = X.shape[1].

    TRIPWIRE 2:  Bracket shape -- list vs bag
       WRONG: X = df[{'a','b','c'}]  (curly brackets = a bag, unordered).
       RIGHT: X = df[['a','b','c']]  (square brackets = a list, ordered).
              Outer [] = fetch from the sheet.  Inner [] = the list of names.

    TRIPWIRE 3:  random_state must be the SAME number everywhere
       WRONG: random_state=1 in one place and RANDOM_STATE in another.
       RIGHT: RANDOM_STATE = 42 everywhere.  Different numbers give different
              splits, so the hidden tests fail.

    TRIPWIRE 4:  k (neighbour count) and fold count (grading rounds) are
                 unrelated
       WRONG: "k=25 with 5 folds -- that can't be right."
       RIGHT: k=n_neighbors.  5 folds = n_splits.  Totally different knobs.

    TRIPWIRE 5:  The toolbox maximises, so it negates the mistake
       WRONG: pick the k with the highest raw score.
       RIGHT: cross_val_score returns NEGATIVE numbers (it maximises).
              "Biggest" = closest to zero = smallest mistake.
              Un-negate with -scores.mean().

    TRIPWIRE 6:  min(cv, key=cv.get) vs plain min(cv)
       WRONG: plain min(cv) on a shelf of {k: mistake} returns the smallest
              KEY (k=1), not the value (mistake).
       RIGHT: min(cv, key=cv.get) returns the KEY with the smallest VALUE.
              Get the name of the winner, not the smallest name alphabetically.


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  Everything above was pencil and pictures. Here is the same machine in Python, using the
  scikit-learn toolbox -- two snippets: the rule itself, and the hunt for the best k.

  >> NEW TO PYTHON? Each piece below, named once:
       from X import Y    -- fetch a ready-made tool named Y out of toolbox X
       name = value       -- store a value under a name
       thing.do(stuff)    -- tell an object to do something (a "method")
       def name(args):    -- define your own reusable step (a "function")
       [a, b, c]          -- a list: an ordered row of things
       {k: v for ...}     -- a dict in one line: a labelled shelf, name -> value

  Bind the ruler-learner to the ask-closest rule, so the ruler is relearned on each
  round's building portion and never on the grading chunk:

    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.metrics import mean_squared_error

    knn = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=5))
    knn.fit(X_train, y_train)          # stores pile; learns mu, sd on working rows
    pred = knn.predict(X_test)         # same ruler -> measure gaps -> average

    rmse = mean_squared_error(y_test, pred) ** 0.5   # ** 0.5 means square root
    print(round(rmse, 3))

  And the hunt for the best k -- try each candidate, score it by rotating folds, keep the
  winner. Passing the whole bound rule to the checker relearns the ruler each round, so
  nothing leaks:

    from sklearn.model_selection import KFold, cross_val_score

    def fold_mistake(k, X_train, y_train):
        rule = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=k))
        kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
        scores = cross_val_score(rule, X_train, y_train, cv=kf,
                                 scoring='neg_root_mean_squared_error')
        return round(-scores.mean(), 3)

    candidates = [1, 3, 5, 7, 9, 15, 25]
    cv = {k: fold_mistake(k, X_train, y_train) for k in candidates}   # k -> its mistake
    best_k = min(cv, key=cv.get)    # the k whose mistake is smallest

  !! WARN: THREE THINGS THAT TRIP PEOPLE UP
     (1) k (how many neighbours) and the fold count (how many grading rounds) are
         unrelated -- k=25 with 5 folds is fine.
     (2) The toolbox maximises its score, so it negates the mistake; un-negate with
         -scores.mean().
     (3) min(cv, key=cv.get) returns the key with the smallest value; plain min(cv)
         returns the smallest key -- wrong answer.

----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 1 -- Predicting House Prices):
    Part 1 -- The Full Picture .
    Part 2 (this post) .
    Part 3 -- Straight-Stick Rule

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================