Committees: Bagging, Random Forest, and Boosting

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 5 . QUESTION CHARTS AND COMMITTEES . PART 3 OF 3
  Committees: Bagging, Random Forest, and Boosting
  Posted: 2026-06-07 . Author: Rahul Rai . Tags: bagging, random-forest, boosting, ensembles
  ============================================================================================

  PATH . post 15 of 28  (end of the supervised-learning book)
    <- prev:  Chapter 5, Part 2: The Mixing Ruler
       next:  Chapter 6, Part 1: Looking at a Sheet With No Answers ->

  Parts 1 and 2 built one question chart and tamed it with pruning.  One chart is jumpy
  -- it memorises quirks of whichever crowd it studied.  This post builds TWO HUNDRED
  charts and makes them vote.  The average of two hundred jumpy guesses is eerily steady
  -- the same "wisdom of the crowds" that makes a county-fair ox-weight guess land near
  the truth when you poll the whole crowd.

  Three committees, in order of cleverness:

    bagging       = deal 200 piles WITH REPEATS, build 200 trees, AVERAGE
    random forest = bagging + HIDE COLUMNS at each cut
    boosting      = a LINE of fixers, each one repairs the last one's leftovers

  Each builds on the previous.  The ideas are small.  The power is in the pile-up.


  ## Bagging: Deal, Build, Average

  One tree is jumpy.  Why?  Because it was trained on THIS crowd.  A different crowd
  would give a different tree.  The cure: make MANY crowds, build one tree per crowd,
  and average the guesses.

  The problem: you only have one study pile.  You cannot visit the hospital 200 times.
  So you FAKE it -- the same trick from
  Chapter 4, Part 2 (Bootstrap):

    pull a person, write them down, PUT THEM BACK
    repeat until the fake pile is the same size as the real one
    -> some people drawn twice, some never drawn at all

  That fake pile is a STICKY NOTE.  Deal 200 sticky notes.  Build one tree on each.

    sticky note 1 (80 people, some repeated)  --build-->  tree 1  -->  guess 150
    sticky note 2 (80 people, different mix)  --build-->  tree 2  -->  guess 162
    ...
    sticky note 200                           --build-->  tree 200 --> guess 158

    bagging's answer for this patient = AVERAGE(150, 162, ..., 158) = ~154

  !! WARN: STICKY NOTE != TREE
     The sticky note is a PILE of people (deals, votes nothing).
     The tree is BUILT from that pile (the cut-lines, casts the vote).

       sticky note (pile) --build--> tree (cut-lines) --> one guess into the average

     The sticky note's only job: be a slightly-different pile (because of repeats),
     so its tree becomes a slightly-different voter.  If all 200 piles were identical,
     the 200 trees would be identical and averaging would do nothing.
     The randomness IS the point.

  ** KEY: ONE TREE = JUMPY GARBAGE. 200 TREES AVERAGED = STEADY AND TRUSTWORTHY.
     The jumps point in different directions and cancel when averaged.  This is the
     "wisdom of the crowds."  No single tree improved -- only the average did.

  ## Vote Counting: One Patient's 200 Votes

  "Average 200 guesses" is abstract.  Let me show the actual vote sheet for one
  patient:

    patient X:  bmi=0.06, age=0.04, bp=0.02, ...

    tree #    guess for patient X
    ------------------------------
     1        151
     2        134
     3        163
    ...
     200      142

    bagging's guess = (151 + 134 + 163 + ... + 142) / 200 = ~148

  One patient, 200 trees, 200 votes.  One average.  That's the final guess.

  >> YOUR TURN
     A tiny committee of 5 trees votes 150, 160, 140, 155, 145 for one patient
     (made-up).  Work the committee's guess.

     check your slate:  sum = 150 + 160 + 140 + 155 + 145 = 750;  guess = 750 / 5 =
     150.  The average steadies the five jumpy votes into one -- and 200 votes
     steady it far more.

  !! WARN: 80 (STICKY NOTE SIZE) != 200 (TREES) != 442 (TOTAL PEOPLE)
     Each sticky note holds ~80 people drawn WITH REPLACEMENT from 354.
     Each sticky note builds exactly ONE tree.
     200 sticky notes -> 200 trees -> 200 votes per patient.
     The sticky note size (80) affects how diverse the trees are, NOT
     how many votes each patient gets.  Every tree votes once per patient,
     regardless of how many people were on its sticky note.

  The name: bagging = Bootstrap AGGregating.  Two wrappers around one tree:

    BOOTSTRAP  = the sticky-note dealing (with put-back)  -> makes 200 different piles
    AGGREGATE  = ask all 200 trees, AVERAGE the answers


  ## The Free Exam (OOB) -- Again

  IN HAND: 200 trees, each built from a sticky-note pile dealt WITH REPEATS, all voting
  once per patient and averaged into one steady guess.  This section harvests the people
  each deal happened to leave out -- a free exam, no rows held back.

  >> YOUR TURN
     "About 37% sit out each deal" -- check it for a tiny pile of n = 4 dealt with
     repeats.  The chance one named person is missed in all 4 draws is (1 - 1/4)^4.

     check your slate:  (1 - 1/4)^4 = (3/4)^4 = (3 x 3 x 3 x 3)/(4 x 4 x 4 x 4) =
     81 / 256 ~= 0.316.  About 32% sit out even at n = 4;  as the pile grows the
     fraction settles toward 1/e ~= 0.37.

  The ~37% of people left out of each sticky note were never seen by that tree.  Grade
  it on those left-out people = an honest, FREE exam.  Same trick as Chapter 4's
  out-of-bag error.

    tree #1:  trained on sticky note #1
              ~37% of people not on that note -> grade tree #1 on them -> one error
    ...
    tree #200: same idea

    average those free-exam errors = OOB score

  Spread out on paper, the OOB looks like this:

    sticky note   missed people   tree guesses   OOB error
    ------------------------------------------------------
       1          p4, p6          yes               42
       2          p1, p3, p7      yes               38
       3          p2, p5          yes               45
      ...          ...            ...              ...
      200         p3, p8          yes               40

    OOB = (42 + 38 + 45 + ... + 40) / 200

  You get an honest error WITHOUT touching the sealed 25% exam pile.


  ## Number Trees vs Bin Trees

  Trees work for BOTH answer types.  Bagging uses whichever fits:

    number answer (diabetes):  each leaf guesses its MEAN  -> bagging AVERAGES 200 guesses
    bin answer    (cancer):    each leaf guesses MAJORITY   -> bagging takes MAJORITY VOTE

  The lab's bagging stage uses the diabetes/number tree -> 200 trees -> average.


  ## Random Forest: Bagging + Hide Columns

  Bagging's flaw: all 200 trees look at the SAME 10 columns at each cut.  If bmi is
  the strongest column, it wins the top cut of nearly every tree -> the trees RHYME.
  Averaging rhyming trees helps less than averaging truly different ones.

  The fix: at each cut, roll dice and let the tree see only a random HANDFUL of columns.

    bagging cut:  may try all 10 columns -> champion of 10
    forest cut:   rolls dice, sees only ~3 columns -> champion of 3

    cut 1:  dice pick {bmi, bp, s5}      -> ignore age, weight HERE
    cut 2:  dice RE-ROLL {age, weight, s1} -> bmi ignored HERE
    cut 3:  dice re-roll {weight, bmi, s3} -> ...

    "ignored" = only for the cut where it wasn't drawn
    next cut -> fresh dice -> different columns on offer

  Over a whole tree (and 300 trees), every column gets many chances.  No single strong
  column can dominate every cut of every tree.

  !! WARN: THE CUT HIDES COLUMNS, NOT PEOPLE
     All people are still split up/down.  Nobody is hidden.  The knob limits which
     COLUMNS the cut may consider when choosing its question.

  The knob is max_features -- how many columns each cut may peek at:

    1.0    = all columns   (= plain bagging, trees rhyme)
    0.5    = half the columns
    'sqrt' = about sqrt(10) ~= 3 columns  (the default, a small handful)

  Fewer columns per cut -> trees differ more -> averaging steadier -> better exam error.
  But too few -> each tree is too weak.  Sweep all three and let the exam-error pick.

  ** KEY: FOREST = BAGGING + ONE KNOB (max_features)
     The only real news is that each cut sees fewer columns.  Everything else --
     bootstrap piles, build trees, average guesses, OOB exam -- is identical to bagging.

  !! WARN: WHY FOREST USES MORE TREES THAN BAGGING
     Bagging used 200 trees.  Forest uses 300.  Why?

       bagging:  each tree sees all 10 columns at every cut.
                 bmi dominates -> trees rhyme.  You don't need as many
                 trees because each one is already strong on its own.

       forest:   each tree sees only ~3 random columns per cut.
                 Each tree is WEAKER (it ignores 7 columns per cut).
                 You need MORE trees to compensate for the weakness.
                 More trees = more votes = the average steadies out.

     200 vs 300 is just a knob (n_estimators).  The idea: weaker trees
     need more friends to reach the same steadiness.  If you hid even
     fewer columns (max_features=0.3), you'd need even more trees.


  ## Feature Importances: Which Columns Leaned Hardest

  After building 300 trees, you can ask: "which columns did the most work?"  Each time a
  column's cut drops the badness, it gets credit.  Add up the credit across all 300 trees
  = feature_importances_.  Rank them: the column at the top was leaned on the hardest.

  This is not the same as "which column matters most in nature" -- it's "which column the
  trees used the most."  A correlated stand-in might steal credit from the true cause.


  ## Boosting: A Line of Fixers

  Bagging and forests build trees IN PARALLEL on different piles.  Boosting builds them
  IN A LINE -- each new tree fixes the LEFTOVER mistakes of the one before:

    tree 1:  guess the answer              -> leftovers = answer - guess
    tree 2:  guess the LEFTOVERS from #1   -> new leftovers
    tree 3:  guess the new leftovers       -> newer leftovers
    ...
    tree 300: by now the leftovers are tiny

    final guess = tree1 + tree2 + tree3 + ... + tree300
                  (each tree adds its small correction)

  Each tree is TINY -- usually max_depth=1 (a single question, called a STUMP).  One
  stump is weak.  Three hundred stumps, each patching the previous, become strong.

  The knob is learning_rate -- how much to trust each new fixer:

    learning_rate = 1.0   -> trust fully  -> learns fast, may overfit
    learning_rate = 0.1   -> trust 10%    -> slower, steadier
    learning_rate = 0.03  -> trust 3%     -> slow and safe

    final guess = tree1 + lr*tree2 + lr*tree3 + ... + lr*tree300

  To pick the best (learning_rate, n_trees): hold out 30% of the study pile as
  validation, watch the validation error at each step (staged_predict), pick the
  pass where validation error bottomed out.

  !! WARN: WATCH THE VALIDATION CURVE, NOT THE TRAINING CURVE
     Training error keeps dropping (the line of fixers keeps fixing its own pile).
     Validation error drops, bottoms out, then RISES -- that bottom is where to stop.
     Beyond it, each new stump is memorising study-pile noise.

  ## Boosting: The Leftover Arithmetic

  "Guess the leftovers" is abstract.  Let me show the actual table:

    4 patients, 3 stumps.  Each stump is max_depth=1 (one question -- a STUMP).

    person  answer  stump1_guess  leftover1  stump2_guess  leftover2  stump3_guess  leftover3
    ------- ------- ------------- ---------- ------------- ---------- ------------- ----------
       A      151       140          11          8             3          2            1
       B       75        80          -5         -4            -1          0           -1
       C      141       130          11          8             3          2            1
       D      121       130          -9         -7            -2         -2            0

    round 1:  stump1 is built on the REAL answers -> guess1
              leftover1 = answer - guess1

    round 2:  stump2 is built on the LEFTOVER1 column -> guess2
              leftover2 = leftover1 - guess2

    round 3:  stump3 is built on the LEFTOVER2 column -> guess3
              leftover3 = leftover2 - guess3  (approaching zero)

    final guess = stump1 + lr*stump2 + lr*stump3 + ...
                = 140    + 0.1*8    + 0.1*2    + ...

  Without the learning rate (lr=1.0):
    patient A: 140 + 8 + 2 = 150  <- overshot!  Real answer is 151, now 150.
    The stumps jumped too hard and landed past the target.

  With lr=0.1:
    patient A: 140 + 0.1*8 + 0.1*2 = 140 + 0.8 + 0.2 = 141  <- under, but steady.

  The small step means each stump takes a tiny bite instead of a huge one.
  The line moves toward the answer more slowly, but it does not overshoot.

  After each round, the leftovers shrink.  The stumps focus on what the previous
  ones missed.  By round 3, the leftovers are near zero -- the line has learned
  the pattern.

  !! WARN: YOU DO NOT SORT THE LEFTOVER COLUMN
     Each round, the tree looks at the 10 body measurements (bmi, age, bp, ...)
     and asks "which column gives the cleanest cut on the CURRENT target?"
     The target changes (answer -> leftover1 -> leftover2), but the 10 columns
     stay in the SAME order.  You sort bmi the same way every round.  The
     leftover column is not one of the 10 measurements -- it is the current
     target.  There is nothing to sort.

  ## The Learning Curve: Where to Stop

  Each stump reduces study error.  But at some point, stumps start memorising
  noise.  Hold out 30% of the study pile as validation and watch the error:

    pass       validation MSE    direction
    ----------------------------------------
       0         40.2            baseline (mean only)
      10         34.1            down
      20         31.3            down
      30         29.0            down
      40         28.2            bottom <- BEST (n_estimators=40)
      50         28.5            up     <- overfitting starts
     100         30.1            up
     300         32.0            up

    Training error keeps dropping forever.  Validation error drops, bottoms
    out, then rises.  The bottom is where you stop.  Beyond it, each new
    stump is memorising study-pile noise while harming exam performance.


  ## Interpretation 1: Scramble (Which Column Mattered)

  You have 300 frozen trees.  The exam pile is sealed.  True MSE = 25.

  First, grade the forest on the real exam pile -> true MSE = 25 (baseline).

  Then pick one column -- say "radius."  Shuffle its values across the 88 exam
  people.  Swap p1's radius with p7's.  p7's with p3's.  The radius VALUES stay
  the same, but they're attached to wrong people.

  Now feed the SCRAMBLED exam pile to the same 300 frozen trees.  Each person
  walks each tree.  Average 300 guesses per person.  Compare to real answers
  -> new MSE.

  So if the new MSE jumped UP, that column mattered.

  WHY DOES SCRAMBLE WORK?

    Trees don't change.  The exam pile changes.  A person with scrambled
    radius answers the tree's cuts wrong and lands in a wrong leaf.

    Not all trees get fooled.  Only trees that USED "radius" in their cuts
    send the person to a different leaf.  Trees that only asked about bmi
    or age give the same guess as before.  The average shifts, but not to
    pure garbage.

    Concrete example -- one person, one tree:

      tree #42 has a cut: "radius > 20?"
      p1's REAL radius = 15  -> answers NO  -> goes LEFT  -> leaf guess = 140
      p1's SCRAMBLED radius = 22 (from p7) -> answers YES -> goes RIGHT -> leaf guess = 190

      p1 with fake radius 22 lands in the wrong crowd.  Leaf guess 190
      is the average of people who REALLY had radius > 20.  p1 doesn't
      belong there.  The guess is wrong for p1.

    But not every tree uses radius.  Here is what one person walks
    across three different trees:

      tree #1 cut: "radius > 20?"    p1 says fake 22 -> YES -> wrong leaf     FOOLED
      tree #2 cut: "bmi > 0.05?"     never asks radius  -> same leaf          SAFE
      tree #3 cut: "radius > 15?"    p1 says fake 22 -> YES -> wrong leaf     FOOLED

    Trees that used radius are fooled.  Trees that never asked about radius
    give the same guess as before.  Across 300 trees, enough used radius
    that the average shifts -- the MSE jumps up.

  Repeat for every column.  Build a table:

    column      true MSE    scrambled MSE    jump
    ----------------------------------------------
    radius        25           45             +20   <- mattered
    texture       25           38             +13   <- mattered some
    smoothness    25           26             + 1   <- didn't matter
    perimeter     25           42             +17   <- mattered

    The jump = how much the honest score worsens when that column is destroyed.

  !! WARN: SCRAMBLE DOES NOT REBUILD THE FOREST
     The 300 trees were built ONCE, before any scrambling.  Scramble happens
     on the EXAM pile only.  Trees stay frozen.  You don't rebuild.  You
     don't re-bootstrap.  You don't re-fit.  Just shuffle and re-grade.

  !! WARN: THE JUMP IS NOT "CAUSATION"
     If radius and perimeter are correlated, scrambling one might make the
     other look more important than it is.  The jump tells you "how much the
     forest relied on this column" -- not "how much this column matters in
     the real world."


  ## Interpretation 2: Slide (How the Column Pushes the Guess)

  Scramble tells you WHICH column mattered.  Slide tells you HOW it pushes
  the guess.  The push is called a Partial Dependence Plot (PDP).

  First, pick one column -- say "bmi."  Keep all other 9 columns at their real
  values.

  Then give EVERY person the SAME fake bmi value.  Feed to 300 frozen trees.
  Average 300 guesses per person.  Record the average guess across all 88 people.

  Now change the fake bmi to the next value, and repeat.

  Finally, plot fake bmi (x) vs average guess (y).

  A concrete tree to see how this works (simplified to 2 cuts):

               bmi > 0.045 ?
              /             \
            YES              NO
             |                |
          bmi > 0.075?    age > 30?
           /      \        /      \
         YES       NO    YES       NO
          |         |      |         |
         200      180     120       100

  Slide: fake bmi = 0.01 for ALL 88 people.

    person  fake_bmi  path                      guess
    --------------------------------------------------
    p1      0.01      bmi>0.045? NO -> age>30? NO      100
    p2      0.01      NO -> age>30? YES                 120
    p3      0.01      NO -> age>30? YES                 120
    ...
    p88     0.01      NO -> age>30? NO                  100

    SAME fake bmi (0.01).  DIFFERENT guesses (100 or 120)
    because age splits them further at the second level.

  Now fake bmi = 0.06 for ALL 88 people:

    person  fake_bmi  path                      guess
    --------------------------------------------------
    p1      0.06      bmi>0.045? YES -> bmi>0.075? NO  180
    p2      0.06      YES -> bmi>0.075? NO              180
    ...
    p88     0.06      YES -> bmi>0.075? NO              180

    This tree only has one bmi question below 0.045, so everyone
    with fake bmi=0.06 lands at guess 180.

  Now repeat for fake bmi values [0.01, 0.03, 0.06, 0.09]:

    fake bmi    avg guess across 88 people
    ----------------------------------------
    0.01        (100 + 120 + 120 + ... + 100) / 88  =  115
    0.03        same as 0.01  (still below 0.045)   =  115
    0.06        everyone above 0.045 -> all 180      =  180
    0.09        same as 0.06  (still above 0.045)    =  180

  Plot:

    guess
      ^
    180 ========================  <- PDP (one line, average of all 88)
      |
    115 ========================
      +-----+--------+--------+---> bmi
          0.01     0.06     0.09

  PDP mixes people.  It shows the crowd average.  ICE (Individual Conditional
  Expectation) shows EACH person's line:

    guess
      ^
    180 ------------------------  <- p2,p3 (age>30, jump together)
      |       /
    120 ------/------------------  <- p2,p3 start here
      |     /
    100 -/----------------------  <- p1,p88 (age<=30, start lower)
      +-----+--------+--------+---> bmi
          0.01     0.06     0.09

    p1:  100 --------------- 180   (jumps at bmi>0.045)
    p2:  120 --------------- 180   (already higher at low bmi)
    p88: 100 --------------- 180   (same as p1)

    All 88 lines jump at the SAME fake bmi (the cut point 0.045).
    But some start at 100, some at 120.  ICE shows this spread.
    PDP hides it by averaging into one smooth line.

  !! WARN: PDP HIDES THE MESS.  ICE SHOWS IT.
     PDP = one line, average of all people at each fake value.
     ICE = 88 lines, one per person, their individual guess at each fake value.
     If the tree had more cuts, different people would jump at different fake
     values -- ICE lines spread apart.  PDP smooths them into a single curve.

  !! WARN: THE COUNT IS FAKE COLUMN VALUES, NOT TREES
     For each fake bmi value: 88 people each walk 300 trees -> 88 averages.
     You test 20 fake bmi values -> 20 points per person -> 88 lines.
     The trees don't change.  Only the fake column value changes.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    sticky note / fake pile               bootstrap sample (with replacement)
    deal with repeats                     bootstrap / sampling with replacement
    average all trees' guesses            bagging / bootstrap aggregating
    the ~37% left out                     out-of-bag (OOB) sample
    the free exam                         OOB error estimate
    hide columns per cut                  random subspace / max_features
    the trees differ (decorrelated)       decorrelated ensemble
    line of fixers                        gradient boosting
    a tiny tree (one stump)               a weak learner / max_depth=1
    leftovers from the last fixer         residuals of the current ensemble
    trust-each-fixer knob                 learning rate / shrinkage
    where validation error bottoms out    watch error after each stump / early stopping
    scramble a column, measure drop       permutation importance
    slide one column, watch the guess     partial dependence plot (PDP)
    one line per person                   individual conditional expectation (ICE)
    n_estimators                          number of trees in the ensemble
    feature_importances_                  mean decrease in impurity (MDI)


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       BaggingRegressor()           -- 200 trees, each on a bootstrap pile, averaged
       RandomForestRegressor()      -- bagging + hide columns (max_features)
       GradientBoostingRegressor()  -- line of fixers, each on the last's leftovers
       permutation_importance()     -- scramble a column, measure the drop
       PartialDependenceDisplay     -- the slide-a-column plot

    from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.tree import DecisionTreeRegressor

    # bagging (200 trees, bootstrap, OOB free exam)
    bag = BaggingRegressor(
        DecisionTreeRegressor(random_state=42),
        n_estimators=200, bootstrap=True, oob_score=True, random_state=42)
    bag.fit(X_train, y_train)
    mse_bag = mean_squared_error(y_test, bag.predict(X_test))
    oob_r2  = bag.oob_score_

    # random forest (300 trees, hide columns, OOB)
    rf = RandomForestRegressor(
        n_estimators=300, max_features='sqrt',
        bootstrap=True, oob_score=True, random_state=42)
    rf.fit(X_train, y_train)
    # rf.feature_importances_  = how much each column was leaned on

    # boosting (300 stumps, learning_rate, staged validation curve)
    gbr = GradientBoostingRegressor(
        n_estimators=300, max_depth=1, learning_rate=0.1, random_state=42)
    gbr.fit(X_tr, y_tr)
    val_mses = [mean_squared_error(y_val, yp) for yp in gbr.staged_predict(X_val)]
    best_iter = np.argmin(val_mses) + 1   # where validation bottomed out

    # interpretation
    from sklearn.inspection import permutation_importance, PartialDependenceDisplay
    perm = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
    # perm.importances_mean = how much each column's scramble hurt the score
    # PartialDependenceDisplay.from_estimator(rf, X_test, [top_feature], kind='both')


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 5 -- Question Charts and Committees):
    Part 1 -- Question Charts by Hand .
    Part 2 -- The Mixing Ruler .
    Part 3 (this post)

  That closes the supervised-learning half of the book.  Continue with the next chapter:
  Chapter 6: Finding Patterns Without Answers.  For quick lookups,
  see the Appendix -- Classification Reference and the
  Glossary / Decoder Ring.
  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================