Question Charts: Building a Tree by Hand

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 5 . QUESTION CHARTS AND COMMITTEES . PART 1 OF 3
  Question Charts: Building a Tree by Hand
  Posted: 2026-06-07 . Author: Rahul Rai . Tags: decision-tree, regression, classification
  ============================================================================================

  PATH . post 13 of 28
    <- prev:  Chapter 4, Part 3: The Dial by Hand
       next:  Chapter 5, Part 2: The Mixing Ruler ->

  The straight-stick rule lived by dials -- ten multipliers and a nudge. Every guess was
  a weighted sum. This post throws away the dials entirely and builds a machine with no
  formula, no multipliers, and no algebra. Just a sheet, a pencil, and one repeated
  question: "what is the CLEANEST place to draw a line?"

  A question chart is a stack of yes/no questions. A new patient answers them top to
  bottom and lands in one final pile, where the guess sits waiting. The whole engine is
  built by splitting people into ever-smaller groups until each group's answers are tight
  enough that their average is a good guess.


  ## The Sheet: Two New Piles

    diabetes -- 442 people, 10 body measurements, one answer
    patient   age   bmi   bp    s1 ... s6  |  answer
    --------------------------------------------------
     #1      0.04  0.06  0.02  ...         |   151    <- disease score one YEAR later
     #2     -0.01 -0.05 -0.03  ...         |    75
     ...                                   |   ...
    +--------------- 10 columns -----------+|  the answer

    cancer -- 569 lumps, 30 measurements, one answer
    lump   radius  texture  perimeter ...  |  target
    --------------------------------------------------
     #1     17.9    10.4    122.8    ...    |    0     <- sick(0) or well(1)
     #2     20.6    17.8    132.9    ...    |    0
     ...                                   |   ...

  Same job as always: chop 75/25, cover the 25%, build on the 75%, grade once.


  ## BUILD vs USE -- Two Different Times

  This is where the first confusion lives. There are two SEPARATE phases, and mixing
  them caused me grief for an hour:

    phase 1 . BUILD . happens ONCE on the whole study pile at once
        split 331 people into piles, pile by pile, choosing questions
        PEOPLE get split here -- the entire crowd, not one at a time

    phase 2 . USE . happens later, one patient at a time
        a new patient walks the finished questions top to bottom
        no splitting, no choosing -- just follow the existing signs

  During BUILD, splitting and choosing are interleaved -- you cannot separate them.
  During USE, the chart is frozen and the patient just walks through it.

    BUILD:  choose split <-> walk one level <-> choose split <-> ...  (interleaved)
    USE:    chart frozen -> walk each patient top -> bottom in one go  (walk only)


  ## Inside the Build Hunt, Nothing Hidden

  IN HAND: two sheets, each chopped 75/25 -- diabetes: 442 people x 3/4 = 331.5, call
  it 331 on the study pile, 442 - 331 = 111 sealed for the exam.  Two phases kept
  apart: BUILD splits the whole crowd once; USE walks one patient through the frozen
  chart.  This section adds: what BUILD actually does, line by line, nothing hidden.

  Everything below = what one line, tree_reg.fit(X_train, y_train), does inside.
  I rebuilt it from a blank sheet. No black box.

  Pick a column. Say bmi. I see 331 bmi values.

    First, sort the column and find the midpoints.

      sort the 331 bmi values smallest to largest.
      between each pair of neighbours write their MIDPOINT (pairwise mean).
      331 values -> 330 midpoints.  each midpoint is a CUT to try.

    Take the first midpoint and draw a line across the whole sheet.

      everyone with bmi BELOW the midpoint -> left pile
      everyone with bmi ABOVE             -> right pile

      left pile answers:  {150, 170, 130, ...}   mean = some number
      right pile answers: {80, 90, 100, ...}     mean = some other number

      these two means are the TWO GUESSES.  the left pile's guess = left mean.
      the right pile's guess = right mean.

    Now measure how WRONG those guesses are.

      for EVERY person:  miss = (their answer - their pile's mean)
      square each miss.  add them ALL up.

      that sum = this midpoint's BADNESS.  one number.

      330 midpoints -> 330 badness numbers.  the SMALLEST = bmi's champion.

  !! WARN: THE SUM IS NOT AN RMSE
     There is no root, no division.  Just sum of squared misses.  A huge number.
     You are comparing 330 huge numbers and picking the smallest.  The magnitude does
     not matter -- only the ORDER.

    Repeat that for all 10 columns.

      age: sort, 330 midpoints, 330 badness numbers -> age's champion
      bp:  sort, ...                                -> bp's champion
      ...all 10 columns

    Now pick the champion of champions.

      10 column-champions, each with its own badness.  all in the SAME units
      (sum of squared misses over the same 331 people), so they are FAIR to
      compare.

      the SMALLEST of the 10 = the chart's FIRST question.
      (say bmi at midpoint 0.05 wins.)

    Clerk bill for this FIRST question: 330 cuts x 10 columns = 3,300 cuts, each
    graded by all 331 people (a miss and a squaring apiece) = 3,300 x 331 =
    1,092,300 squarings.  The clerks have it by tea; you read one champion table.

    So draw the line.

      go back to the sheet.  everyone with bmi <= 0.05 goes to pile A.
      everyone above goes to pile B.

    Then recurse.

      inside pile A (say 131 people):
        run the ENTIRE hunt again -- sort, midpoints, badness, champion, champion
        of champions -- on just THESE 131 people.  this pile's champion becomes
        its own question.

      inside pile B (200 people):
        same hunt.  its own champion.  a different question from pile A's.

      keep cracking each new pile the same way -- until a pile is too small or
      all its answers match.  then STOP.

    Finally, the final pile's guess.

      each final pile: average its answer column -> one number.
      that is the guess for any new person who lands in it.


  ## A Worked Example: 4 People, 1 Column

  A sheet with 4 people, 1 measurement, 1 answer.  Every number visible.  Every
  cut's arithmetic done with pencil and paper.  No calculator needed.

    person   bmi   age   answer
    ----------------------------
     A      0.18   55     151
     B      0.06   35     121
     C      0.04   25      97
     D      0.12   45     135

  First, sort by bmi and find midpoints.

    sorted:   C(0.04,97)  B(0.06,121)  D(0.12,135)  A(0.18,151)
    +- between C&B -+  +- between B&D -+  +- between D&A -+
    midpoint:  0.05             0.09             0.15

  3 midpoints -> 3 cuts to try.

  Try the cut at 0.05.  Below = {C}.  Above = {B, D, A}.

    below:  one person (97) -> mean = 97  (no arithmetic needed)
    above:  121 + 135 = 256.  256 + 151 = 407.  407 / 3 = 135.666...
            call it 135.7 (close enough to rank against other cuts)

    Now measure each miss.  "Squared" = multiply the number by itself.
    A minus times a minus makes a plus (negative * negative = positive).

    C:  97 - 97  =    0   0 * 0     =    0
    B:  121 - 135.7 = -14.7  14.7 * 14.7
        14 * 14 = 196.  14 * 0.7 = 9.8 (done twice = 19.6).  0.7 * 0.7 = 0.49
        Total: 196 + 19.6 + 0.49 = 216.09.  Call it 216.
    D:  135 - 135.7 = -0.7  0.7 * 0.7 = 0.49.  Call it 0.5.
    A:  151 - 135.7 =  15.3  15.3 * 15.3
        15 * 15 = 225.  15 * 0.3 = 4.5 (done twice = 9).  0.3 * 0.3 = 0.09
        Total: 225 + 9 + 0.09 = 234.09.  Call it 234.

    sum = 0 + 216 + 0.5 + 234 = 450.5

  Try the cut at 0.09.  Below = {C, B}.  Above = {D, A}.

    below:  97 + 121 = 218.  218 / 2 = 109.
    above:  135 + 151 = 286.  286 / 2 = 143.

    C:  97 - 109 = -12.  -12 * -12 = 144  (minus * minus = plus; 12*12=144)
    B:  121 - 109 =  12.   12 *  12 = 144
    D:  135 - 143 =  -8.   -8 *  -8 =  64  (8*8=64)
    A:  151 - 143 =   8.    8 *   8 =  64

    sum = 144 + 144 + 64 + 64 = 416   <- CHAMPION (smallest)

  Try the cut at 0.15.  Below = {C, B, D}.  Above = {A}.

    below:  97 + 121 = 218.  218 + 135 = 353.  353 / 3 = 117.666...  call it 117.7.
    above:  one person (151) -> mean = 151.  (no arithmetic needed)

    C:  97 - 117.7 = -20.7  20.7 * 20.7
        20 * 20 = 400.  20 * 0.7 = 14 (done twice = 28).  0.7 * 0.7 = 0.49
        Total: 400 + 28 + 0.49 = 428.49.  Call it 429.
    B:  121 - 117.7 = 3.3  3.3 * 3.3
        3 * 3 = 9.  3 * 0.3 = 0.9 (done twice = 1.8).  0.3 * 0.3 = 0.09
        Total: 9 + 1.8 + 0.09 = 10.89.  Call it 11.
    D:  135 - 117.7 = 17.3  17.3 * 17.3
        17 * 17 = 289.  17 * 0.3 = 5.1 (done twice = 10.2).  0.3 * 0.3 = 0.09
        Total: 289 + 10.2 + 0.09 = 299.29.  Call it 299.
    A:  151 - 151 = 0.  0 * 0 = 0.

  Champion table:

    cut at    sum of squared misses
    --------------------------------
    0.05      451.5
    0.09      416    <- smallest -> first question
    0.15      739

  That is bmi's champion.  But there is also age, and age deserves its own hunt.

  >> YOUR TURN
     A made-up cut splits four answers into below {100, 120} and above {150, 170}.
     Work its badness (squared misses around each side's own mean).

     check your slate:  below mean = (100 + 120) / 2 = 220 / 2 = 110;  misses
     100 - 110 = -10 (sq 100), 120 - 110 = 10 (sq 100).  above mean = (150 + 170) / 2
     = 320 / 2 = 160;  misses 150 - 160 = -10 (sq 100), 170 - 160 = 10 (sq 100).
     badness = 100 + 100 + 100 + 100 = 400.

  >> YOUR TURN
     A leaf at the bottom of the chart holds three people with answers 130, 140, 180.
     What single number does the chart guess for anyone who lands there?

     check your slate:  guess = mean = (130 + 140 + 180) / 3 = 450 / 3 = 150.  A leaf
     always guesses the average of the answers that fell into it -- one number for
     everyone in that room.

  ## Now Add Age: The Second Column

  IN HAND: bmi's champion -- the cut at 0.09, badness 144 + 144 + 64 + 64 = 416,
  beating its neighbours at 0.05 and 0.15.  But one column's champion is not yet
  the chart's first question.  This section adds: age's own hunt, then the champion
  of champions.

  Sort age: 25, 35, 45, 55.  Midpoints: 30, 40, 50.

  Cut at 30:
    Below: C (97) -> mean 97.  miss = 0.
    Above: B(121), D(135), A(151) -> mean 135.7.
    Same squared misses as the bmi 0.05 cut: 0 + 216 + 0.5 + 234 = 450.5.

  Cut at 40:
    Below: C(97), B(121) -> mean 109.
      C: 97-109=-12 -> 144.  B: 121-109=12 -> 144.
    Above: D(135), A(151) -> mean 143.
      D: 135-143=-8 -> 64.  A: 151-143=8 -> 64.
    Sum = 144+144+64+64 = 416.

  Cut at 50:
    Below: C(97), B(121), D(135) -> mean 117.7.
      C: 429.  B: 11.  D: 299.
    Above: A(151) -> mean 151.  miss=0.
    Sum = 429+11+299+0 = 739.

  Age's champion: cut at 40, badness = 416.

  Champion-of-champions table:

      column   champion cut   badness
      -------------------------------
      bmi          0.09         416
      age            40         416

  Tie at 416.  Arbitrary rule: bmi wins because it came first.
  First question: "bmi > 0.09?"  -> below goes to one room, above to another.

  Each room then runs its OWN hunt on its own people -> a different question.

  !! WARN: THE SUM IS NOT AN RMSE
     There is no root, no division.  Just the raw sum of squared misses.
     A huge number that only exists to COMPARE cuts.  RMSE would multiply
     every number by the same constant (1/4, then root).  Order wouldn't
     change.  Waste of ink.

  !! WARN: THE THREE CUTS COMPETE FAIRLY
     Each cut is scored against the SAME 4 people.  The sums are in the
     same units.  451.5 and 416 and 739 are all "total wrongness if we
     draw the line here."  The smallest wins.

    !! WARN: THE CHAMPION IS LOCAL, NOT GLOBAL
       The 0.09 cut is bmi's champion.  Age's champion (40, badness 416)
       ties it exactly.  Column competition -- the champion of champions
       -- happens only when the column champions race each other.  A tie
       means someone picks arbitrarily.


  ## The Corrections I Caught

  Building this by hand exposed five wrong pictures I would have carried forever:

    WRONG: "the tree asks one patient all its questions."
    RIGHT: the BUILD splits the whole study CROWD at once.
           a patient walks the finished chart only in USE.

    WRONG: "bmi is used once, then another column takes over."
    RIGHT: the same column can be reused at a deeper level with a TIGHTER cut.
           bmi > 0.05 at level 1, then bmi > 0.12 at level 3.  nothing "used up."

    WRONG: "all rooms at the same depth ask the same question."
    RIGHT: each room hires its own doctor.  same menu of candidates, but each
           room scores them against ITS OWN crowd -> different winners.

    WRONG: "one doctor asks bmi AND age at once."
    RIGHT: always one column, one cutoff per split.  you can ask bmi HERE and
           age in the NEXT room -- but never both squashed into one question.

    WRONG: "max depth = number of columns."
    RIGHT: depth = how many questions deep.  columns can be reused, so depth can
           exceed column count.  the real ceiling is PEOPLE running out (a pile
           of 1 cannot be split).

  ** KEY: THE TREE HAS NO FORMULA AND NO DIALS
     There are no multipliers anywhere.  No sum of weighted columns.  No algebra.
     Just questions and a number at each end.  The "flexibility knob" is DEPTH: how
     many questions deep before you hit an answer.


  ## The Leaf -- What Sits at the Bottom

  A final pile (a leaf) is not always one person.  It can hold several people with
  identical (or nearly identical) measurements but DIFFERENT answers.  No question
  separates them, so they sit together:

    leaf with 5 people, targets {180, 120, 150, 90, 160}
    no question can split these
    guess for this room = (180+120+150+90+160) / 5 = 140

    a new patient who lands here gets 140 -- "people like you averaged 140."

    leaf with 1 person, target 173
    guess = 173   <- memorised that one person

  Both are the same rule: guess = mean of the leaf's answers.  One person -> average
  of one -> itself.  Why the mean?  Because it is the single number that drives
  sum(miss^2) to its smallest -- the same "bottom of the bowl" from Chapter 4.


  ## Depth: The Stiff-vs-Jumpy Dial of Trees

    depth 1  (one question only):    2 guesses       too stubborn
    depth 3  (three layers):         up to 8 guesses
    depth 10 (ten layers):           up to 1024       too jumpy, memorises

    study-error:   high ---------> ~0      (deep memorises everything)
    exam-error:    high ---V--- high       (both ends bad)
                      sweet spot = lowest exam-error

  With max_depth=None, the tree keeps slicing until every leaf holds one person --
  perfectly memorised, panics on new data.  The DEPTH SWEEP tries each ceiling
  [1, 2, 3, 4, 5, 6, 8, 10, None] and picks the one with lowest exam-error.

  Two stop signs -- the tree halts at whichever fires first:

    STOP #1:  you hit your ceiling           (max_depth)
    STOP #2:  a branch has ~1 person         (can't split one person)

    max_depth = 1024 on 331 people?  people run out at ~12.  you never reach 1024.
    max_depth = None?  same thing.   both mean "no real ceiling."


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    question chart                        decision tree
    the whole build hunt (.fit)           recursive binary splitting / CART
    one yes/no question per room          a split / a node
    the final pile at the bottom          a leaf / terminal node
    the guess (mean of targets)           the predicted value
    badness (sum of squared misses)       the residual sum of squares (RSS) for a split
    midpoint between sorted values        candidate split threshold
    champion of champions                 the best split (lowest impurity / RSS)
    depth                                 max_depth hyperparameter
    the chart is lopsided                 the tree is not balanced
    same menu, different winner           greedy, locally optimal splitting


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  The entire build hunt -- sort, midpoints, badness, champion, recurse -- is one line:

  >> NEW TO PYTHON? Each named once:
       DecisionTreeRegressor()  -- an empty question-chart machine for number answers
       .fit(X, y)               -- run the whole build hunt on the study pile
       .predict(X)              -- walk each row through the frozen chart, read guesses

    from sklearn.tree import DecisionTreeRegressor
    from sklearn.metrics import mean_squared_error, r2_score

    tree_reg = DecisionTreeRegressor(random_state=42)
    tree_reg.fit(X_train, y_train)                    # the whole BUILD hunt
    y_pred = tree_reg.predict(X_test)                 # USE: walk exam patients
    mse = mean_squared_error(y_test, y_pred)
    r2  = r2_score(y_test, y_pred)

    # depth sweep: try each ceiling, pick lowest exam-error
    depths = [1, 2, 3, 4, 5, 6, 8, 10, None]
    for d in depths:
        t = DecisionTreeRegressor(max_depth=d, random_state=42)
        t.fit(X_train, y_train)
        print(d, mean_squared_error(y_test, t.predict(X_test)))
    # pick the depth with smallest test MSE

  !! WARN: EVERY ROOM'S QUESTION IS FOUND BY BRUTE FORCE
     The machine does not guess "try bmi."  It tries EVERY column times EVERY
     midpoint, measures the badness of EACH, and keeps the one with the smallest
     number.  No astrology, no guessing -- just the one that measures cleanest.
     The cost: for each room, (columns x midpoints) trials.  That adds up.


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 5 -- Question Charts and Committees):
    Part 1 (this post) .
    Part 2 -- The Mixing Ruler .
    Part 3 -- Committees

  Previous chapter: Chapter 4 -- Humble Dials
  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================