==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 5 . QUESTION CHARTS AND COMMITTEES . PART 3 OF 3
Committees: Bagging, Random Forest, and Boosting
Posted: 2026-06-07 . Author: Rahul Rai . Tags: bagging, random-forest, boosting, ensembles
============================================================================================
PATH . post 15 of 28 (end of the supervised-learning book)
<- prev: Chapter 5, Part 2: The Mixing Ruler
next: Chapter 6, Part 1: Looking at a Sheet With No Answers ->
Parts 1 and 2 built one question chart and tamed it with pruning. One chart is jumpy
-- it memorises quirks of whichever crowd it studied. This post builds TWO HUNDRED
charts and makes them vote. The average of two hundred jumpy guesses is eerily steady
-- the same "wisdom of the crowds" that makes a county-fair ox-weight guess land near
the truth when you poll the whole crowd.
Three committees, in order of cleverness:
bagging = deal 200 piles WITH REPEATS, build 200 trees, AVERAGE
random forest = bagging + HIDE COLUMNS at each cut
boosting = a LINE of fixers, each one repairs the last one's leftovers
Each builds on the previous. The ideas are small. The power is in the pile-up.
## Bagging: Deal, Build, Average
One tree is jumpy. Why? Because it was trained on THIS crowd. A different crowd
would give a different tree. The cure: make MANY crowds, build one tree per crowd,
and average the guesses.
The problem: you only have one study pile. You cannot visit the hospital 200 times.
So you FAKE it -- the same trick from
Chapter 4, Part 2 (Bootstrap):
pull a person, write them down, PUT THEM BACK
repeat until the fake pile is the same size as the real one
-> some people drawn twice, some never drawn at all
That fake pile is a STICKY NOTE. Deal 200 sticky notes. Build one tree on each.
sticky note 1 (80 people, some repeated) --build--> tree 1 --> guess 150
sticky note 2 (80 people, different mix) --build--> tree 2 --> guess 162
...
sticky note 200 --build--> tree 200 --> guess 158
bagging's answer for this patient = AVERAGE(150, 162, ..., 158) = ~154
!! WARN: STICKY NOTE != TREE
The sticky note is a PILE of people (deals, votes nothing).
The tree is BUILT from that pile (the cut-lines, casts the vote).
sticky note (pile) --build--> tree (cut-lines) --> one guess into the average
The sticky note's only job: be a slightly-different pile (because of repeats),
so its tree becomes a slightly-different voter. If all 200 piles were identical,
the 200 trees would be identical and averaging would do nothing.
The randomness IS the point.
** KEY: ONE TREE = JUMPY GARBAGE. 200 TREES AVERAGED = STEADY AND TRUSTWORTHY.
The jumps point in different directions and cancel when averaged. This is the
"wisdom of the crowds." No single tree improved -- only the average did.
## Vote Counting: One Patient's 200 Votes
"Average 200 guesses" is abstract. Let me show the actual vote sheet for one
patient:
patient X: bmi=0.06, age=0.04, bp=0.02, ...
tree # guess for patient X
------------------------------
1 151
2 134
3 163
...
200 142
bagging's guess = (151 + 134 + 163 + ... + 142) / 200 = ~148
One patient, 200 trees, 200 votes. One average. That's the final guess.
>> YOUR TURN
A tiny committee of 5 trees votes 150, 160, 140, 155, 145 for one patient
(made-up). Work the committee's guess.
check your slate: sum = 150 + 160 + 140 + 155 + 145 = 750; guess = 750 / 5 =
150. The average steadies the five jumpy votes into one -- and 200 votes
steady it far more.
!! WARN: 80 (STICKY NOTE SIZE) != 200 (TREES) != 442 (TOTAL PEOPLE)
Each sticky note holds ~80 people drawn WITH REPLACEMENT from 354.
Each sticky note builds exactly ONE tree.
200 sticky notes -> 200 trees -> 200 votes per patient.
The sticky note size (80) affects how diverse the trees are, NOT
how many votes each patient gets. Every tree votes once per patient,
regardless of how many people were on its sticky note.
The name: bagging = Bootstrap AGGregating. Two wrappers around one tree:
BOOTSTRAP = the sticky-note dealing (with put-back) -> makes 200 different piles
AGGREGATE = ask all 200 trees, AVERAGE the answers
## The Free Exam (OOB) -- Again
IN HAND: 200 trees, each built from a sticky-note pile dealt WITH REPEATS, all voting
once per patient and averaged into one steady guess. This section harvests the people
each deal happened to leave out -- a free exam, no rows held back.
>> YOUR TURN
"About 37% sit out each deal" -- check it for a tiny pile of n = 4 dealt with
repeats. The chance one named person is missed in all 4 draws is (1 - 1/4)^4.
check your slate: (1 - 1/4)^4 = (3/4)^4 = (3 x 3 x 3 x 3)/(4 x 4 x 4 x 4) =
81 / 256 ~= 0.316. About 32% sit out even at n = 4; as the pile grows the
fraction settles toward 1/e ~= 0.37.
The ~37% of people left out of each sticky note were never seen by that tree. Grade
it on those left-out people = an honest, FREE exam. Same trick as Chapter 4's
out-of-bag error.
tree #1: trained on sticky note #1
~37% of people not on that note -> grade tree #1 on them -> one error
...
tree #200: same idea
average those free-exam errors = OOB score
Spread out on paper, the OOB looks like this:
sticky note missed people tree guesses OOB error
------------------------------------------------------
1 p4, p6 yes 42
2 p1, p3, p7 yes 38
3 p2, p5 yes 45
... ... ... ...
200 p3, p8 yes 40
OOB = (42 + 38 + 45 + ... + 40) / 200
You get an honest error WITHOUT touching the sealed 25% exam pile.
## Number Trees vs Bin Trees
Trees work for BOTH answer types. Bagging uses whichever fits:
number answer (diabetes): each leaf guesses its MEAN -> bagging AVERAGES 200 guesses
bin answer (cancer): each leaf guesses MAJORITY -> bagging takes MAJORITY VOTE
The lab's bagging stage uses the diabetes/number tree -> 200 trees -> average.
## Random Forest: Bagging + Hide Columns
Bagging's flaw: all 200 trees look at the SAME 10 columns at each cut. If bmi is
the strongest column, it wins the top cut of nearly every tree -> the trees RHYME.
Averaging rhyming trees helps less than averaging truly different ones.
The fix: at each cut, roll dice and let the tree see only a random HANDFUL of columns.
bagging cut: may try all 10 columns -> champion of 10
forest cut: rolls dice, sees only ~3 columns -> champion of 3
cut 1: dice pick {bmi, bp, s5} -> ignore age, weight HERE
cut 2: dice RE-ROLL {age, weight, s1} -> bmi ignored HERE
cut 3: dice re-roll {weight, bmi, s3} -> ...
"ignored" = only for the cut where it wasn't drawn
next cut -> fresh dice -> different columns on offer
Over a whole tree (and 300 trees), every column gets many chances. No single strong
column can dominate every cut of every tree.
!! WARN: THE CUT HIDES COLUMNS, NOT PEOPLE
All people are still split up/down. Nobody is hidden. The knob limits which
COLUMNS the cut may consider when choosing its question.
The knob is max_features -- how many columns each cut may peek at:
1.0 = all columns (= plain bagging, trees rhyme)
0.5 = half the columns
'sqrt' = about sqrt(10) ~= 3 columns (the default, a small handful)
Fewer columns per cut -> trees differ more -> averaging steadier -> better exam error.
But too few -> each tree is too weak. Sweep all three and let the exam-error pick.
** KEY: FOREST = BAGGING + ONE KNOB (max_features)
The only real news is that each cut sees fewer columns. Everything else --
bootstrap piles, build trees, average guesses, OOB exam -- is identical to bagging.
!! WARN: WHY FOREST USES MORE TREES THAN BAGGING
Bagging used 200 trees. Forest uses 300. Why?
bagging: each tree sees all 10 columns at every cut.
bmi dominates -> trees rhyme. You don't need as many
trees because each one is already strong on its own.
forest: each tree sees only ~3 random columns per cut.
Each tree is WEAKER (it ignores 7 columns per cut).
You need MORE trees to compensate for the weakness.
More trees = more votes = the average steadies out.
200 vs 300 is just a knob (n_estimators). The idea: weaker trees
need more friends to reach the same steadiness. If you hid even
fewer columns (max_features=0.3), you'd need even more trees.
## Feature Importances: Which Columns Leaned Hardest
After building 300 trees, you can ask: "which columns did the most work?" Each time a
column's cut drops the badness, it gets credit. Add up the credit across all 300 trees
= feature_importances_. Rank them: the column at the top was leaned on the hardest.
This is not the same as "which column matters most in nature" -- it's "which column the
trees used the most." A correlated stand-in might steal credit from the true cause.
## Boosting: A Line of Fixers
Bagging and forests build trees IN PARALLEL on different piles. Boosting builds them
IN A LINE -- each new tree fixes the LEFTOVER mistakes of the one before:
tree 1: guess the answer -> leftovers = answer - guess
tree 2: guess the LEFTOVERS from #1 -> new leftovers
tree 3: guess the new leftovers -> newer leftovers
...
tree 300: by now the leftovers are tiny
final guess = tree1 + tree2 + tree3 + ... + tree300
(each tree adds its small correction)
Each tree is TINY -- usually max_depth=1 (a single question, called a STUMP). One
stump is weak. Three hundred stumps, each patching the previous, become strong.
The knob is learning_rate -- how much to trust each new fixer:
learning_rate = 1.0 -> trust fully -> learns fast, may overfit
learning_rate = 0.1 -> trust 10% -> slower, steadier
learning_rate = 0.03 -> trust 3% -> slow and safe
final guess = tree1 + lr*tree2 + lr*tree3 + ... + lr*tree300
To pick the best (learning_rate, n_trees): hold out 30% of the study pile as
validation, watch the validation error at each step (staged_predict), pick the
pass where validation error bottomed out.
!! WARN: WATCH THE VALIDATION CURVE, NOT THE TRAINING CURVE
Training error keeps dropping (the line of fixers keeps fixing its own pile).
Validation error drops, bottoms out, then RISES -- that bottom is where to stop.
Beyond it, each new stump is memorising study-pile noise.
## Boosting: The Leftover Arithmetic
"Guess the leftovers" is abstract. Let me show the actual table:
4 patients, 3 stumps. Each stump is max_depth=1 (one question -- a STUMP).
person answer stump1_guess leftover1 stump2_guess leftover2 stump3_guess leftover3
------- ------- ------------- ---------- ------------- ---------- ------------- ----------
A 151 140 11 8 3 2 1
B 75 80 -5 -4 -1 0 -1
C 141 130 11 8 3 2 1
D 121 130 -9 -7 -2 -2 0
round 1: stump1 is built on the REAL answers -> guess1
leftover1 = answer - guess1
round 2: stump2 is built on the LEFTOVER1 column -> guess2
leftover2 = leftover1 - guess2
round 3: stump3 is built on the LEFTOVER2 column -> guess3
leftover3 = leftover2 - guess3 (approaching zero)
final guess = stump1 + lr*stump2 + lr*stump3 + ...
= 140 + 0.1*8 + 0.1*2 + ...
Without the learning rate (lr=1.0):
patient A: 140 + 8 + 2 = 150 <- overshot! Real answer is 151, now 150.
The stumps jumped too hard and landed past the target.
With lr=0.1:
patient A: 140 + 0.1*8 + 0.1*2 = 140 + 0.8 + 0.2 = 141 <- under, but steady.
The small step means each stump takes a tiny bite instead of a huge one.
The line moves toward the answer more slowly, but it does not overshoot.
After each round, the leftovers shrink. The stumps focus on what the previous
ones missed. By round 3, the leftovers are near zero -- the line has learned
the pattern.
!! WARN: YOU DO NOT SORT THE LEFTOVER COLUMN
Each round, the tree looks at the 10 body measurements (bmi, age, bp, ...)
and asks "which column gives the cleanest cut on the CURRENT target?"
The target changes (answer -> leftover1 -> leftover2), but the 10 columns
stay in the SAME order. You sort bmi the same way every round. The
leftover column is not one of the 10 measurements -- it is the current
target. There is nothing to sort.
## The Learning Curve: Where to Stop
Each stump reduces study error. But at some point, stumps start memorising
noise. Hold out 30% of the study pile as validation and watch the error:
pass validation MSE direction
----------------------------------------
0 40.2 baseline (mean only)
10 34.1 down
20 31.3 down
30 29.0 down
40 28.2 bottom <- BEST (n_estimators=40)
50 28.5 up <- overfitting starts
100 30.1 up
300 32.0 up
Training error keeps dropping forever. Validation error drops, bottoms
out, then rises. The bottom is where you stop. Beyond it, each new
stump is memorising study-pile noise while harming exam performance.
## Interpretation 1: Scramble (Which Column Mattered)
You have 300 frozen trees. The exam pile is sealed. True MSE = 25.
First, grade the forest on the real exam pile -> true MSE = 25 (baseline).
Then pick one column -- say "radius." Shuffle its values across the 88 exam
people. Swap p1's radius with p7's. p7's with p3's. The radius VALUES stay
the same, but they're attached to wrong people.
Now feed the SCRAMBLED exam pile to the same 300 frozen trees. Each person
walks each tree. Average 300 guesses per person. Compare to real answers
-> new MSE.
So if the new MSE jumped UP, that column mattered.
WHY DOES SCRAMBLE WORK?
Trees don't change. The exam pile changes. A person with scrambled
radius answers the tree's cuts wrong and lands in a wrong leaf.
Not all trees get fooled. Only trees that USED "radius" in their cuts
send the person to a different leaf. Trees that only asked about bmi
or age give the same guess as before. The average shifts, but not to
pure garbage.
Concrete example -- one person, one tree:
tree #42 has a cut: "radius > 20?"
p1's REAL radius = 15 -> answers NO -> goes LEFT -> leaf guess = 140
p1's SCRAMBLED radius = 22 (from p7) -> answers YES -> goes RIGHT -> leaf guess = 190
p1 with fake radius 22 lands in the wrong crowd. Leaf guess 190
is the average of people who REALLY had radius > 20. p1 doesn't
belong there. The guess is wrong for p1.
But not every tree uses radius. Here is what one person walks
across three different trees:
tree #1 cut: "radius > 20?" p1 says fake 22 -> YES -> wrong leaf FOOLED
tree #2 cut: "bmi > 0.05?" never asks radius -> same leaf SAFE
tree #3 cut: "radius > 15?" p1 says fake 22 -> YES -> wrong leaf FOOLED
Trees that used radius are fooled. Trees that never asked about radius
give the same guess as before. Across 300 trees, enough used radius
that the average shifts -- the MSE jumps up.
Repeat for every column. Build a table:
column true MSE scrambled MSE jump
----------------------------------------------
radius 25 45 +20 <- mattered
texture 25 38 +13 <- mattered some
smoothness 25 26 + 1 <- didn't matter
perimeter 25 42 +17 <- mattered
The jump = how much the honest score worsens when that column is destroyed.
!! WARN: SCRAMBLE DOES NOT REBUILD THE FOREST
The 300 trees were built ONCE, before any scrambling. Scramble happens
on the EXAM pile only. Trees stay frozen. You don't rebuild. You
don't re-bootstrap. You don't re-fit. Just shuffle and re-grade.
!! WARN: THE JUMP IS NOT "CAUSATION"
If radius and perimeter are correlated, scrambling one might make the
other look more important than it is. The jump tells you "how much the
forest relied on this column" -- not "how much this column matters in
the real world."
## Interpretation 2: Slide (How the Column Pushes the Guess)
Scramble tells you WHICH column mattered. Slide tells you HOW it pushes
the guess. The push is called a Partial Dependence Plot (PDP).
First, pick one column -- say "bmi." Keep all other 9 columns at their real
values.
Then give EVERY person the SAME fake bmi value. Feed to 300 frozen trees.
Average 300 guesses per person. Record the average guess across all 88 people.
Now change the fake bmi to the next value, and repeat.
Finally, plot fake bmi (x) vs average guess (y).
A concrete tree to see how this works (simplified to 2 cuts):
bmi > 0.045 ?
/ \
YES NO
| |
bmi > 0.075? age > 30?
/ \ / \
YES NO YES NO
| | | |
200 180 120 100
Slide: fake bmi = 0.01 for ALL 88 people.
person fake_bmi path guess
--------------------------------------------------
p1 0.01 bmi>0.045? NO -> age>30? NO 100
p2 0.01 NO -> age>30? YES 120
p3 0.01 NO -> age>30? YES 120
...
p88 0.01 NO -> age>30? NO 100
SAME fake bmi (0.01). DIFFERENT guesses (100 or 120)
because age splits them further at the second level.
Now fake bmi = 0.06 for ALL 88 people:
person fake_bmi path guess
--------------------------------------------------
p1 0.06 bmi>0.045? YES -> bmi>0.075? NO 180
p2 0.06 YES -> bmi>0.075? NO 180
...
p88 0.06 YES -> bmi>0.075? NO 180
This tree only has one bmi question below 0.045, so everyone
with fake bmi=0.06 lands at guess 180.
Now repeat for fake bmi values [0.01, 0.03, 0.06, 0.09]:
fake bmi avg guess across 88 people
----------------------------------------
0.01 (100 + 120 + 120 + ... + 100) / 88 = 115
0.03 same as 0.01 (still below 0.045) = 115
0.06 everyone above 0.045 -> all 180 = 180
0.09 same as 0.06 (still above 0.045) = 180
Plot:
guess
^
180 ======================== <- PDP (one line, average of all 88)
|
115 ========================
+-----+--------+--------+---> bmi
0.01 0.06 0.09
PDP mixes people. It shows the crowd average. ICE (Individual Conditional
Expectation) shows EACH person's line:
guess
^
180 ------------------------ <- p2,p3 (age>30, jump together)
| /
120 ------/------------------ <- p2,p3 start here
| /
100 -/---------------------- <- p1,p88 (age<=30, start lower)
+-----+--------+--------+---> bmi
0.01 0.06 0.09
p1: 100 --------------- 180 (jumps at bmi>0.045)
p2: 120 --------------- 180 (already higher at low bmi)
p88: 100 --------------- 180 (same as p1)
All 88 lines jump at the SAME fake bmi (the cut point 0.045).
But some start at 100, some at 120. ICE shows this spread.
PDP hides it by averaging into one smooth line.
!! WARN: PDP HIDES THE MESS. ICE SHOWS IT.
PDP = one line, average of all people at each fake value.
ICE = 88 lines, one per person, their individual guess at each fake value.
If the tree had more cuts, different people would jump at different fake
values -- ICE lines spread apart. PDP smooths them into a single curve.
!! WARN: THE COUNT IS FAKE COLUMN VALUES, NOT TREES
For each fake bmi value: 88 people each walk 300 trees -> 88 averages.
You test 20 fake bmi values -> 20 points per person -> 88 lines.
The trees don't change. Only the fake column value changes.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
sticky note / fake pile bootstrap sample (with replacement)
deal with repeats bootstrap / sampling with replacement
average all trees' guesses bagging / bootstrap aggregating
the ~37% left out out-of-bag (OOB) sample
the free exam OOB error estimate
hide columns per cut random subspace / max_features
the trees differ (decorrelated) decorrelated ensemble
line of fixers gradient boosting
a tiny tree (one stump) a weak learner / max_depth=1
leftovers from the last fixer residuals of the current ensemble
trust-each-fixer knob learning rate / shrinkage
where validation error bottoms out watch error after each stump / early stopping
scramble a column, measure drop permutation importance
slide one column, watch the guess partial dependence plot (PDP)
one line per person individual conditional expectation (ICE)
n_estimators number of trees in the ensemble
feature_importances_ mean decrease in impurity (MDI)
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
BaggingRegressor() -- 200 trees, each on a bootstrap pile, averaged
RandomForestRegressor() -- bagging + hide columns (max_features)
GradientBoostingRegressor() -- line of fixers, each on the last's leftovers
permutation_importance() -- scramble a column, measure the drop
PartialDependenceDisplay -- the slide-a-column plot
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
# bagging (200 trees, bootstrap, OOB free exam)
bag = BaggingRegressor(
DecisionTreeRegressor(random_state=42),
n_estimators=200, bootstrap=True, oob_score=True, random_state=42)
bag.fit(X_train, y_train)
mse_bag = mean_squared_error(y_test, bag.predict(X_test))
oob_r2 = bag.oob_score_
# random forest (300 trees, hide columns, OOB)
rf = RandomForestRegressor(
n_estimators=300, max_features='sqrt',
bootstrap=True, oob_score=True, random_state=42)
rf.fit(X_train, y_train)
# rf.feature_importances_ = how much each column was leaned on
# boosting (300 stumps, learning_rate, staged validation curve)
gbr = GradientBoostingRegressor(
n_estimators=300, max_depth=1, learning_rate=0.1, random_state=42)
gbr.fit(X_tr, y_tr)
val_mses = [mean_squared_error(y_val, yp) for yp in gbr.staged_predict(X_val)]
best_iter = np.argmin(val_mses) + 1 # where validation bottomed out
# interpretation
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
perm = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
# perm.importances_mean = how much each column's scramble hurt the score
# PartialDependenceDisplay.from_estimator(rf, X_test, [top_feature], kind='both')
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 5 -- Question Charts and Committees):
Part 1 -- Question Charts by Hand .
Part 2 -- The Mixing Ruler .
Part 3 (this post)
That closes the supervised-learning half of the book. Continue with the next chapter:
Chapter 6: Finding Patterns Without Answers. For quick lookups,
see the Appendix -- Classification Reference and the
Glossary / Decoder Ring.
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================