==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 5 . QUESTION CHARTS AND COMMITTEES . PART 1 OF 3
Question Charts: Building a Tree by Hand
Posted: 2026-06-07 . Author: Rahul Rai . Tags: decision-tree, regression, classification
============================================================================================
PATH . post 13 of 28
<- prev: Chapter 4, Part 3: The Dial by Hand
next: Chapter 5, Part 2: The Mixing Ruler ->
The straight-stick rule lived by dials -- ten multipliers and a nudge. Every guess was
a weighted sum. This post throws away the dials entirely and builds a machine with no
formula, no multipliers, and no algebra. Just a sheet, a pencil, and one repeated
question: "what is the CLEANEST place to draw a line?"
A question chart is a stack of yes/no questions. A new patient answers them top to
bottom and lands in one final pile, where the guess sits waiting. The whole engine is
built by splitting people into ever-smaller groups until each group's answers are tight
enough that their average is a good guess.
## The Sheet: Two New Piles
diabetes -- 442 people, 10 body measurements, one answer
patient age bmi bp s1 ... s6 | answer
--------------------------------------------------
#1 0.04 0.06 0.02 ... | 151 <- disease score one YEAR later
#2 -0.01 -0.05 -0.03 ... | 75
... | ...
+--------------- 10 columns -----------+| the answer
cancer -- 569 lumps, 30 measurements, one answer
lump radius texture perimeter ... | target
--------------------------------------------------
#1 17.9 10.4 122.8 ... | 0 <- sick(0) or well(1)
#2 20.6 17.8 132.9 ... | 0
... | ...
Same job as always: chop 75/25, cover the 25%, build on the 75%, grade once.
## BUILD vs USE -- Two Different Times
This is where the first confusion lives. There are two SEPARATE phases, and mixing
them caused me grief for an hour:
phase 1 . BUILD . happens ONCE on the whole study pile at once
split 331 people into piles, pile by pile, choosing questions
PEOPLE get split here -- the entire crowd, not one at a time
phase 2 . USE . happens later, one patient at a time
a new patient walks the finished questions top to bottom
no splitting, no choosing -- just follow the existing signs
During BUILD, splitting and choosing are interleaved -- you cannot separate them.
During USE, the chart is frozen and the patient just walks through it.
BUILD: choose split <-> walk one level <-> choose split <-> ... (interleaved)
USE: chart frozen -> walk each patient top -> bottom in one go (walk only)
## Inside the Build Hunt, Nothing Hidden
IN HAND: two sheets, each chopped 75/25 -- diabetes: 442 people x 3/4 = 331.5, call
it 331 on the study pile, 442 - 331 = 111 sealed for the exam. Two phases kept
apart: BUILD splits the whole crowd once; USE walks one patient through the frozen
chart. This section adds: what BUILD actually does, line by line, nothing hidden.
Everything below = what one line, tree_reg.fit(X_train, y_train), does inside.
I rebuilt it from a blank sheet. No black box.
Pick a column. Say bmi. I see 331 bmi values.
First, sort the column and find the midpoints.
sort the 331 bmi values smallest to largest.
between each pair of neighbours write their MIDPOINT (pairwise mean).
331 values -> 330 midpoints. each midpoint is a CUT to try.
Take the first midpoint and draw a line across the whole sheet.
everyone with bmi BELOW the midpoint -> left pile
everyone with bmi ABOVE -> right pile
left pile answers: {150, 170, 130, ...} mean = some number
right pile answers: {80, 90, 100, ...} mean = some other number
these two means are the TWO GUESSES. the left pile's guess = left mean.
the right pile's guess = right mean.
Now measure how WRONG those guesses are.
for EVERY person: miss = (their answer - their pile's mean)
square each miss. add them ALL up.
that sum = this midpoint's BADNESS. one number.
330 midpoints -> 330 badness numbers. the SMALLEST = bmi's champion.
!! WARN: THE SUM IS NOT AN RMSE
There is no root, no division. Just sum of squared misses. A huge number.
You are comparing 330 huge numbers and picking the smallest. The magnitude does
not matter -- only the ORDER.
Repeat that for all 10 columns.
age: sort, 330 midpoints, 330 badness numbers -> age's champion
bp: sort, ... -> bp's champion
...all 10 columns
Now pick the champion of champions.
10 column-champions, each with its own badness. all in the SAME units
(sum of squared misses over the same 331 people), so they are FAIR to
compare.
the SMALLEST of the 10 = the chart's FIRST question.
(say bmi at midpoint 0.05 wins.)
Clerk bill for this FIRST question: 330 cuts x 10 columns = 3,300 cuts, each
graded by all 331 people (a miss and a squaring apiece) = 3,300 x 331 =
1,092,300 squarings. The clerks have it by tea; you read one champion table.
So draw the line.
go back to the sheet. everyone with bmi <= 0.05 goes to pile A.
everyone above goes to pile B.
Then recurse.
inside pile A (say 131 people):
run the ENTIRE hunt again -- sort, midpoints, badness, champion, champion
of champions -- on just THESE 131 people. this pile's champion becomes
its own question.
inside pile B (200 people):
same hunt. its own champion. a different question from pile A's.
keep cracking each new pile the same way -- until a pile is too small or
all its answers match. then STOP.
Finally, the final pile's guess.
each final pile: average its answer column -> one number.
that is the guess for any new person who lands in it.
## A Worked Example: 4 People, 1 Column
A sheet with 4 people, 1 measurement, 1 answer. Every number visible. Every
cut's arithmetic done with pencil and paper. No calculator needed.
person bmi age answer
----------------------------
A 0.18 55 151
B 0.06 35 121
C 0.04 25 97
D 0.12 45 135
First, sort by bmi and find midpoints.
sorted: C(0.04,97) B(0.06,121) D(0.12,135) A(0.18,151)
+- between C&B -+ +- between B&D -+ +- between D&A -+
midpoint: 0.05 0.09 0.15
3 midpoints -> 3 cuts to try.
Try the cut at 0.05. Below = {C}. Above = {B, D, A}.
below: one person (97) -> mean = 97 (no arithmetic needed)
above: 121 + 135 = 256. 256 + 151 = 407. 407 / 3 = 135.666...
call it 135.7 (close enough to rank against other cuts)
Now measure each miss. "Squared" = multiply the number by itself.
A minus times a minus makes a plus (negative * negative = positive).
C: 97 - 97 = 0 0 * 0 = 0
B: 121 - 135.7 = -14.7 14.7 * 14.7
14 * 14 = 196. 14 * 0.7 = 9.8 (done twice = 19.6). 0.7 * 0.7 = 0.49
Total: 196 + 19.6 + 0.49 = 216.09. Call it 216.
D: 135 - 135.7 = -0.7 0.7 * 0.7 = 0.49. Call it 0.5.
A: 151 - 135.7 = 15.3 15.3 * 15.3
15 * 15 = 225. 15 * 0.3 = 4.5 (done twice = 9). 0.3 * 0.3 = 0.09
Total: 225 + 9 + 0.09 = 234.09. Call it 234.
sum = 0 + 216 + 0.5 + 234 = 450.5
Try the cut at 0.09. Below = {C, B}. Above = {D, A}.
below: 97 + 121 = 218. 218 / 2 = 109.
above: 135 + 151 = 286. 286 / 2 = 143.
C: 97 - 109 = -12. -12 * -12 = 144 (minus * minus = plus; 12*12=144)
B: 121 - 109 = 12. 12 * 12 = 144
D: 135 - 143 = -8. -8 * -8 = 64 (8*8=64)
A: 151 - 143 = 8. 8 * 8 = 64
sum = 144 + 144 + 64 + 64 = 416 <- CHAMPION (smallest)
Try the cut at 0.15. Below = {C, B, D}. Above = {A}.
below: 97 + 121 = 218. 218 + 135 = 353. 353 / 3 = 117.666... call it 117.7.
above: one person (151) -> mean = 151. (no arithmetic needed)
C: 97 - 117.7 = -20.7 20.7 * 20.7
20 * 20 = 400. 20 * 0.7 = 14 (done twice = 28). 0.7 * 0.7 = 0.49
Total: 400 + 28 + 0.49 = 428.49. Call it 429.
B: 121 - 117.7 = 3.3 3.3 * 3.3
3 * 3 = 9. 3 * 0.3 = 0.9 (done twice = 1.8). 0.3 * 0.3 = 0.09
Total: 9 + 1.8 + 0.09 = 10.89. Call it 11.
D: 135 - 117.7 = 17.3 17.3 * 17.3
17 * 17 = 289. 17 * 0.3 = 5.1 (done twice = 10.2). 0.3 * 0.3 = 0.09
Total: 289 + 10.2 + 0.09 = 299.29. Call it 299.
A: 151 - 151 = 0. 0 * 0 = 0.
Champion table:
cut at sum of squared misses
--------------------------------
0.05 451.5
0.09 416 <- smallest -> first question
0.15 739
That is bmi's champion. But there is also age, and age deserves its own hunt.
>> YOUR TURN
A made-up cut splits four answers into below {100, 120} and above {150, 170}.
Work its badness (squared misses around each side's own mean).
check your slate: below mean = (100 + 120) / 2 = 220 / 2 = 110; misses
100 - 110 = -10 (sq 100), 120 - 110 = 10 (sq 100). above mean = (150 + 170) / 2
= 320 / 2 = 160; misses 150 - 160 = -10 (sq 100), 170 - 160 = 10 (sq 100).
badness = 100 + 100 + 100 + 100 = 400.
>> YOUR TURN
A leaf at the bottom of the chart holds three people with answers 130, 140, 180.
What single number does the chart guess for anyone who lands there?
check your slate: guess = mean = (130 + 140 + 180) / 3 = 450 / 3 = 150. A leaf
always guesses the average of the answers that fell into it -- one number for
everyone in that room.
## Now Add Age: The Second Column
IN HAND: bmi's champion -- the cut at 0.09, badness 144 + 144 + 64 + 64 = 416,
beating its neighbours at 0.05 and 0.15. But one column's champion is not yet
the chart's first question. This section adds: age's own hunt, then the champion
of champions.
Sort age: 25, 35, 45, 55. Midpoints: 30, 40, 50.
Cut at 30:
Below: C (97) -> mean 97. miss = 0.
Above: B(121), D(135), A(151) -> mean 135.7.
Same squared misses as the bmi 0.05 cut: 0 + 216 + 0.5 + 234 = 450.5.
Cut at 40:
Below: C(97), B(121) -> mean 109.
C: 97-109=-12 -> 144. B: 121-109=12 -> 144.
Above: D(135), A(151) -> mean 143.
D: 135-143=-8 -> 64. A: 151-143=8 -> 64.
Sum = 144+144+64+64 = 416.
Cut at 50:
Below: C(97), B(121), D(135) -> mean 117.7.
C: 429. B: 11. D: 299.
Above: A(151) -> mean 151. miss=0.
Sum = 429+11+299+0 = 739.
Age's champion: cut at 40, badness = 416.
Champion-of-champions table:
column champion cut badness
-------------------------------
bmi 0.09 416
age 40 416
Tie at 416. Arbitrary rule: bmi wins because it came first.
First question: "bmi > 0.09?" -> below goes to one room, above to another.
Each room then runs its OWN hunt on its own people -> a different question.
!! WARN: THE SUM IS NOT AN RMSE
There is no root, no division. Just the raw sum of squared misses.
A huge number that only exists to COMPARE cuts. RMSE would multiply
every number by the same constant (1/4, then root). Order wouldn't
change. Waste of ink.
!! WARN: THE THREE CUTS COMPETE FAIRLY
Each cut is scored against the SAME 4 people. The sums are in the
same units. 451.5 and 416 and 739 are all "total wrongness if we
draw the line here." The smallest wins.
!! WARN: THE CHAMPION IS LOCAL, NOT GLOBAL
The 0.09 cut is bmi's champion. Age's champion (40, badness 416)
ties it exactly. Column competition -- the champion of champions
-- happens only when the column champions race each other. A tie
means someone picks arbitrarily.
## The Corrections I Caught
Building this by hand exposed five wrong pictures I would have carried forever:
WRONG: "the tree asks one patient all its questions."
RIGHT: the BUILD splits the whole study CROWD at once.
a patient walks the finished chart only in USE.
WRONG: "bmi is used once, then another column takes over."
RIGHT: the same column can be reused at a deeper level with a TIGHTER cut.
bmi > 0.05 at level 1, then bmi > 0.12 at level 3. nothing "used up."
WRONG: "all rooms at the same depth ask the same question."
RIGHT: each room hires its own doctor. same menu of candidates, but each
room scores them against ITS OWN crowd -> different winners.
WRONG: "one doctor asks bmi AND age at once."
RIGHT: always one column, one cutoff per split. you can ask bmi HERE and
age in the NEXT room -- but never both squashed into one question.
WRONG: "max depth = number of columns."
RIGHT: depth = how many questions deep. columns can be reused, so depth can
exceed column count. the real ceiling is PEOPLE running out (a pile
of 1 cannot be split).
** KEY: THE TREE HAS NO FORMULA AND NO DIALS
There are no multipliers anywhere. No sum of weighted columns. No algebra.
Just questions and a number at each end. The "flexibility knob" is DEPTH: how
many questions deep before you hit an answer.
## The Leaf -- What Sits at the Bottom
A final pile (a leaf) is not always one person. It can hold several people with
identical (or nearly identical) measurements but DIFFERENT answers. No question
separates them, so they sit together:
leaf with 5 people, targets {180, 120, 150, 90, 160}
no question can split these
guess for this room = (180+120+150+90+160) / 5 = 140
a new patient who lands here gets 140 -- "people like you averaged 140."
leaf with 1 person, target 173
guess = 173 <- memorised that one person
Both are the same rule: guess = mean of the leaf's answers. One person -> average
of one -> itself. Why the mean? Because it is the single number that drives
sum(miss^2) to its smallest -- the same "bottom of the bowl" from Chapter 4.
## Depth: The Stiff-vs-Jumpy Dial of Trees
depth 1 (one question only): 2 guesses too stubborn
depth 3 (three layers): up to 8 guesses
depth 10 (ten layers): up to 1024 too jumpy, memorises
study-error: high ---------> ~0 (deep memorises everything)
exam-error: high ---V--- high (both ends bad)
sweet spot = lowest exam-error
With max_depth=None, the tree keeps slicing until every leaf holds one person --
perfectly memorised, panics on new data. The DEPTH SWEEP tries each ceiling
[1, 2, 3, 4, 5, 6, 8, 10, None] and picks the one with lowest exam-error.
Two stop signs -- the tree halts at whichever fires first:
STOP #1: you hit your ceiling (max_depth)
STOP #2: a branch has ~1 person (can't split one person)
max_depth = 1024 on 331 people? people run out at ~12. you never reach 1024.
max_depth = None? same thing. both mean "no real ceiling."
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
question chart decision tree
the whole build hunt (.fit) recursive binary splitting / CART
one yes/no question per room a split / a node
the final pile at the bottom a leaf / terminal node
the guess (mean of targets) the predicted value
badness (sum of squared misses) the residual sum of squares (RSS) for a split
midpoint between sorted values candidate split threshold
champion of champions the best split (lowest impurity / RSS)
depth max_depth hyperparameter
the chart is lopsided the tree is not balanced
same menu, different winner greedy, locally optimal splitting
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
The entire build hunt -- sort, midpoints, badness, champion, recurse -- is one line:
>> NEW TO PYTHON? Each named once:
DecisionTreeRegressor() -- an empty question-chart machine for number answers
.fit(X, y) -- run the whole build hunt on the study pile
.predict(X) -- walk each row through the frozen chart, read guesses
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train, y_train) # the whole BUILD hunt
y_pred = tree_reg.predict(X_test) # USE: walk exam patients
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# depth sweep: try each ceiling, pick lowest exam-error
depths = [1, 2, 3, 4, 5, 6, 8, 10, None]
for d in depths:
t = DecisionTreeRegressor(max_depth=d, random_state=42)
t.fit(X_train, y_train)
print(d, mean_squared_error(y_test, t.predict(X_test)))
# pick the depth with smallest test MSE
!! WARN: EVERY ROOM'S QUESTION IS FOUND BY BRUTE FORCE
The machine does not guess "try bmi." It tries EVERY column times EVERY
midpoint, measures the badness of EACH, and keeps the one with the smallest
number. No astrology, no guessing -- just the one that measures cleanest.
The cost: for each room, (columns x midpoints) trials. That adds up.
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 5 -- Question Charts and Committees):
Part 1 (this post) .
Part 2 -- The Mixing Ruler .
Part 3 -- Committees
Previous chapter: Chapter 4 -- Humble Dials
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================