==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 5 . QUESTION CHARTS AND COMMITTEES . PART 2 OF 3
  The Mixing Ruler: Gini, Information Gain, and Pruning
  Posted: 2026-06-07 . Author: Rahul Rai . Tags: gini, information-gain, pruning, classification
  ============================================================================================

  PATH . post 14 of 28
    <- prev:  Chapter 5, Part 1: Question Charts by Hand
       next:  Chapter 5, Part 3: Committees ->

  Part 1 built the question chart for a NUMBER answer.  Each cut's badness was the sum of
  squared misses around the pile means -- a formula that made sense because the answer was
  a number you could average.  Now the answer is yes/no -- sick or well, 0 or 1 -- and
  averaging stops making sense.  You cannot take the mean of {yes, yes, no}.

  So the badness ruler must change.  The new ruler counts how MIXED each side is (how many
  yes vs how many no), and the chart picks the cut that leaves the sides LEAST mixed.
  That ruler is Gini impurity, and this post derives it by hand from four counts.

  The second half of this post solves a different problem: even a good tree grows too bushy
  and memorises noise.  Pruning snips the weak branches after the fact, and the alpha tax
  decides which branches to fire.


  ## The Classification Tree: One Change at the Bottom

  The build hunt is IDENTICAL to Part 1 -- sort, midpoints, champion, champion of
  champions, recurse.  Two tiny things change:

    NUMBER tree (Part 1):
      each final pile -> AVERAGE of the answers -> one number guess
      badness = sum of squared misses around that average

    YES/NO tree (this post):
      each final pile -> MAJORITY vote -> a bin (yes or no)
      badness = how MIXED each side is (Gini)

    same hunt, same recursion
    only the "ruler for badness" and the "guess at the bottom" differ


  ## From 4 Counts to 1 Number: The Gini Ruler

  At the first pairwise midpoint from the sorted column, draw a line:

    below:  3 yes, 1 no       (4 people)
    above:  1 yes, 4 no       (5 people)
                               ----------
                                9 people total.  4 counts.  that's ALL you need.

  First, write each count as a fraction (the "share"):

    below:  yes = 3 out of 4 = 3/4 .    no = 1 out of 4 = 1/4
    above:  yes = 1 out of 5 = 1/5 .    no = 4 out of 5 = 4/5

  Then compute each side's mixed-ness.

  Each side:  "how likely are two random grabs to DISAGREE?"

    mixed-ness = 1 - (share_yes)^2 - (share_no)^2
                   all            match             match

  For the BELOW side (4 people: 3 yes, 1 no):

    share_yes = 3/4 .  squared = 3/4 * 3/4 = (3*3)/(4*4) = 9/16
    share_no  = 1/4 .  squared = 1/4 * 1/4 = (1*1)/(4*4) = 1/16

    match = 9/16 + 1/16 = 10/16 = 5/8
    differ = 1 - 5/8 = 3/8 = 0.375

    (By pencil: split 8 into 8 eighths.  5/8 match, so 3/8 differ.)

  For the ABOVE side (5 people: 1 yes, 4 no):

    share_yes = 1/5 .  squared = 1/5 * 1/5 = (1*1)/(5*5) = 1/25
    share_no  = 4/5 .  squared = 4/5 * 4/5 = (4*4)/(5*5) = 16/25

    match = 1/25 + 16/25 = 17/25
    differ = 1 - 17/25 = 8/25 = 0.32

    (25/25 - 17/25 = 8/25.  8 / 25 = 0.32, since 25 * 0.32 = 8.)

  Then combine the two sides into one badness number, weighted by
  how many people are on each side.

    below has 4 people out of 9 total  -> weight = 4/9
    above has 5 people out of 9 total  -> weight = 5/9

    cut badness = (4/9)*(3/8) + (5/9)*(8/25)

    First term:  4/9 * 3/8 = (4*3)/(9*8) = 12/72 = 1/6
    Second term: 5/9 * 8/25 = (5*8)/(9*25) = 40/225 = 8/45

    Add:  1/6 + 8/45
          = 15/90 + 16/90    (common denominator: 90.  1/6=15/90, 8/45=16/90)
          = 31/90 ~ 0.344

  There it is.  4 counts -> one number (0.344) for this cut.

  >> YOUR TURN
     A different side holds 2 yes and 3 no (made-up).  Work its Gini on the slate.

     check your slate:  shares -- yes 2/5 = 0.4, no 3/5 = 0.6;  Gini = 1 - 0.4^2 -
     0.6^2 = 1 - 0.16 - 0.36 = 0.48.  Close to the half-half worst of 0.5 -- this
     side is badly mixed, a poor place to stop.

    pure side (all yes):  0       <- can't be cleaner
    half-half:            0.5     <- worst possible
    this cut:             0.344   <- somewhere in between

    each Gini costs only a handful of strokes, but the hunt is wide: 330 cuts x 10
    columns = 3,300 Gini numbers at THIS node alone, and a tree grows many nodes --
    clerk-work that stacks up fast as the chart deepens.
    each of the 330 cuts -> one number
    smallest number = this column's champion (the cleanest split)


  ## A Real Cancer Cut: Radius and Texture

  The earlier example used abstract "yes" and "no."  Here is the same ruler
  tied to the actual cancer sheet the lab works with.

  Four lumps, two measurements (radius and texture), one answer:

      lump   radius   texture   answer
      ---------------------------------
      p1      15       0.20      sick
      p2       8       0.10      well
      p3      22       0.30      sick
      p4      10       0.15      well

  The tree looks for the cleanest cut.  Try radius first.

  Cut radius > 12:

    Below: p2(well), p4(well)  ->  pure well  ->  GINI = 0 (weight 2/4)
    Above: p1(sick), p3(sick)  ->  pure sick  ->  GINI = 0 (weight 2/4)

    Weighted GINI = 0.5 x 0 + 0.5 x 0 = 0  <- perfect, no mixing left

  Cut radius > 15:

    Below: p2(well), p4(well), p1(sick)  ->  1 sick, 2 well
      share_sick = 1/3, share_well = 2/3
      GINI = 1 - (1/3)^2 - (2/3)^2 = 1 - 1/9 - 4/9 = 4/9 ~ 0.444
      Weight = 3/4 = 0.75

    Above: p3(sick)  ->  pure sick  ->  GINI = 0 (weight 1/4)

    Weighted GINI = 0.75 x 0.444 + 0.25 x 0 = 0.333

  Champion for radius: cut > 12 (GINI = 0).  Now try texture.

  Cut texture > 0.12:

    Below: p2(well)  ->  pure well  ->  GINI = 0 (weight 1/4)
    Above: p1(sick), p3(sick), p4(well)  ->  2 sick, 1 well
      share_sick = 2/3, share_well = 1/3
      GINI = 1 - (2/3)^2 - (1/3)^2 = 1 - 4/9 - 1/9 = 4/9 ~ 0.444
      Weight = 3/4 = 0.75

    Weighted GINI = 0.25 x 0 + 0.75 x 0.444 = 0.333

  Champion of champions:

      column     cut     weighted GINI
      ---------------------------------
      radius    > 12        0           <- champion
      texture   > 0.12      0.333

  First question: "radius > 12?"  -> pure sick piles and pure well piles.
  Then each pure pile stops (GINI = 0, nothing left to clean).

  The abstract "3 yes, 1 no" from earlier is the same math -- just wrapped
  in real column names from the cancer lab the machine actually touches.


  IN HAND: a cut turned 4 counts into one mixed-ness number -- below 3 yes 1 no gave
  0.375, above 1 yes 4 no gave 0.32, weighted into 0.344 -- where 0 is pure and 0.5 is
  half-half.  This section shows that 1 - p^2 - q^2 is no magic spell.

  >> YOUR TURN
     A cut drops the parent's mixing from 0.48 down to a weighted-children 0.30 (both
     made-up).  How much did the cut buy -- its information gain?

     check your slate:  gain = parent Gini - weighted child Gini = 0.48 - 0.30 = 0.18.
     A positive gain means the cut left the two sides cleaner than the parent;  the
     champion cut is the one with the biggest gain.

  ## Why That Formula Is Not Astrology

  The formula 1 - p^2 - q^2 looks pulled from thin air.  It isn't.  It counts
  something you can see with your hands.

  Gini = the chance two random grabs from a side DISAGREE.

  Using the below side: 3 yes, 1 no (4 people).  You reach into the pile,
  grab one person (write down yes or no), PUT THEM BACK, grab another.

    Chance BOTH yes?

      first grab:  "3 out of 4 are yes"  -> 3/4 chance
      second grab: same pile (you put the first back), same odds -> 3/4

      "OF" means MULTIPLY.  Three-quarters OF three-quarters:
        3/4 * 3/4 = (3*3)/(4*4) = 9/16

      Why multiply?  Because you need BOTH events to happen.
      "both yes" = "first yes AND second yes" = multiply the fractions.
      (By pencil: 3/4 of 3/4 is 9/16 of the whole.  You can draw a 4-by-4
      grid, shade 3 across and 3 down -- 9 shaded squares out of 16.)

    Chance BOTH no?

      1/4 * 1/4 = 1/16

    Chance they MATCH (both-yes OR both-no)?

      "OR" means ADD.  9/16 + 1/16 = 10/16 = 5/8

    Chance they DIFFER (don't match)?

      1 - 5/8 = 3/8 = 0.375

    That 0.375 is the Gini for this side.  No formula pulled from thin air --
    just counting what happens when you grab two people from the pile.

  Check the edges:

    pure side (all yes):  4 yes, 0 no
      both yes = 4/4 * 4/4 = 1.  both no = 0.  match = 1.  differ = 0.  OK.

    half-half:  2 yes, 2 no
      both yes = 2/4 * 2/4 = 4/16.  both no = 2/4 * 2/4 = 4/16.  match = 8/16 = 1/2.
      differ = 1 - 1/2 = 1/2 = 0.5.  OK.

  Gini is the DEFAULT ruler because it is fast (no logarithms).  The other option
  -- entropy (cross-entropy) -- uses a log instead of squares and often gives the
  same champion.  sklearn's default is criterion='gini' for speed.

    criterion='gini'     -> 1 - p^2 - q^2     fast (multiply and subtract)
    criterion='entropy'  -> -p log p - q log q slow (log is expensive)


  ## Information Gain: The Drop

  A cut's worth is how much it DROPS the mixing from parent to children:

    gain = Gini(parent) - weighted_Gini(children)

    parent:  whole pile, Gini = 0.49
    after cut:  left 0.375, right 0.32, weighted = 0.344
    gain = 0.49 - 0.344 = 0.146

  Higher gain = cleaner split.  The champion is the cut with the HIGHEST gain -- same
  as picking the LOWEST weighted Gini, just measured from the other direction.


  ## Stratify: Keep the Mix Even

  The cancer sheet is ~63% well, ~37% sick.  A plain random split might deal 71% well
  into the exam pile by luck -- unfair.  stratify=y_cls forces both piles to mirror the
  whole sheet's ratio:

    plain split (luck):          stratify split (forced):
    study  60% well              study  63% well   <- same as whole
    exam   71% well  <- skewed   exam   63% well   <- same as whole

  One flag in the split call.  No new idea -- just insurance against a lopsided deal.


  ## Pruning: The Alpha Tax

  A full-grown tree memorises every quirk.  The exam error is high.  Instead of capping
  depth (a blunt ceiling), pruning SNIPS weak branches after the tree is fully grown.

  The idea: charge a TAX per end-room (leaf).  A branch survives only if the wrongness
  it removes is worth more than the tax:

    chart's total cost  =  (total wrongness on study people)  +  alpha x (number of leaves)
                           +-- fit the data ---+                 +--- stay humble ---+

    alpha = 0     -> leaves are FREE  -> keep all  -> overgrown, memorises
    alpha small   -> cheap            -> snip only the weakest twigs
    alpha big     -> leaves expensive -> snip hard  -> down toward a stump

    a doctor splits a room into 2:
      wrongness dropped by 5400?   tax for +1 room = alpha
        if 5400 > alpha   -> keep the doctor   (earns its keep)
        if 5400 < alpha   -> FIRE him, merge rooms back   (not worth the tax)

  ** KEY: MAX_DEPTH IS A CEILING; CCP_ALPHA IS A TAX
     max_depth: chops ALL corridors at height N.  blunt, same haircut everywhere.
     ccp_alpha: snips weak twigs ANYWHERE, keeps strong deep branches.  smart, uneven.


  ## Where Does the Alpha Menu Come From?

  You do NOT guess random alpha values.  The tree hands you the exact list of alphas
  where a branch would snip:

    cost_complexity_pruning_path(X_train, y_train) -> ccp_alphas
    [0.0, 12.5, 80.0, ..., 1755]   <- the staircase of snip-points

  Between these alphas, nothing changes.  The tree pre-computes them.

  !! WARN: CCP_ALPHAS (THE LIST) vs CCP_ALPHA (THE KNOB)
     ccp_alphas = a list the machine hands you (all the meaningful snip-points)
     ccp_alpha  = a single knob you set on a new tree ("how hard to prune")
     Plural = the menu.  Singular = the setting.

  A REAL alpha menu from the diabetes tree:

    alpha       what it does                     leaves left
    -------------------------------------------------------------
    0.0         no snips, full-grown tree         331 leaves
    12.5        snipped 1 twig (drop < 12.5)      330
    80.0        a whole weak branch collapses     310
    204.7       another branch, ~20 leaves gone   290
    550.0       half the tree gone                160
    1755.0      everything collapsed to a stump     1

  Each alpha is ONE branch's saved-mess.  The tree pre-computes them by
  walking bottom-up: for each internal room, compute "how much wrongness
  does this room's children remove?"  That number = the alpha where that
  branch gets snipped.  The tree doesn't invent alphas -- it reads them
  off its own branches.

  !! WARN: SNIP != REBUILD
     The tree was ALREADY fully grown.  Pruning collapses rooms; it does
     NOT rebuild from scratch.  No new .fit.  Just room closure.

  !! WARN: ALPHA VALUES COME FROM THE TREE, NOT THE TEACHER
     You do not guess alpha=[0.1, 1, 10].  The tree hands you exact values
     where branches would fire.  Your job is to pick WHICH one works best
     on the exam -- not to make up alphas.


  ## The Nested Averages Ladder

  To pick the best alpha, the same 5-slice check from Chapter 4.  Three averages,
  stacked, each doing a different job:

    RUNG 1 -- one person's guess (mean of targets in his leaf):
      leaf holds {120, 140, 160}  -> guess = 140

    RUNG 2 -- one round's error (mean of misses over covered people):
      person  guess  truth  miss^2
        p1     140    150    100
        p2      90     80    100
        p3     210    220    100
      round error = mean = (100+100+100)/3 = 100

    RUNG 3 -- one alpha's score (mean of the 5 round-errors):
      round1  round2  round3  round4  round5
       100     120      90     110      80
      alpha's score = 500/5 = 100

    TOP -- pick best alpha (MIN over scores, NOT an average):
      alpha=12.5 -> 105
      alpha=80.0 -> 100   <-- smallest -> alpha_hat
      alpha=300  -> 140

    targets in a leaf  --avg-->  guess         (rung 1)
    misses over people --avg-->  round error   (rung 2)
    5 round-errors     --avg-->  alpha score   (rung 3)
    alpha scores       --MIN-->  alpha_hat     (top, not avg)


  ## Common Tripwires I Caught

  These are the exact wrong pictures I had to untangle -- each one cost me
  real time until I saw the concrete shape of the mistake:

    TRIPWIRE 1:  ccp_alphas (the list) == ccp_alpha (the knob)
       WRONG: they're the same thing, just abbreviated differently.
       RIGHT: ccp_alphas = a list of numbers the machine hands back (the menu).
              ccp_alpha  = one number YOU set on a new tree (the setting).
              Plural = menu.  Singular = knob.  Two different things.

    TRIPWIRE 2:  pruning_path needs a fitted tree first
       WRONG: you must .fit a tree, then call pruning_path on it.
       RIGHT: cost_complexity_pruning_path(X, y) builds its OWN tree
              internally just to read the menu.  No .fit needed.
              The call IS the builder.

    TRIPWIRE 3:  alpha values are invented by the teacher
       WRONG: someone picked [0, 0.01, 0.1, 1] as reasonable guesses.
       RIGHT: the tree computes each alpha from its own branches.
              Each alpha = one branch's saved-mess.  The teacher never
              touches the numbers.

    TRIPWIRE 4:  snip a branch = rebuild the tree
       WRONG: after pruning, the tree refits from scratch on the pruned
              structure.
       RIGHT: the tree was ALREADY fully grown.  ccp_alpha just collapses
              rooms.  No new .fit.  No rebuild.  Just closure.

    TRIPWIRE 5:  in boosting, you sort the leftover column each round
       WRONG: each round, you re-sort bmi/age/bp against the new leftover
              column to find the best cut.
       RIGHT: you sort bmi/age/bp the SAME way every round.  The column
              order never changes.  Only the TARGET column changes
              (answer -> leftover1 -> leftover2 -> ...).  Sorting the
              leftover column is a waste -- the tree asks "which column
              gives the cleanest cut", and the leftover column is not one
              of the 10 body measurements.

    TRIPWIRE 6:  Rung 2 and Rung 3 are the same type of average
       WRONG: they're both "mean of some numbers."
       RIGHT: Rung 2 averages misses ACROSS PEOPLE in one fold.
              Rung 3 averages fold-errors ACROSS FOLDS for one alpha.
              Different granularity, different job.  Rung 2 = per-fold,
              Rung 3 = cross-fold.  Mixing them is the classic cross-val
              confusion.


  ## Any Room Has a Wrongness -- Not Just Leaves

  "You only measure wrongness at the end -- how can you measure it in the middle?"

  Every room has a wrongness by asking "what if I stopped here?"  Its guess would be
  the mean of everyone in the room:

    middle room, 6 people, targets {100,120,140,160,180,200}
    if I STOP here:  guess = mean = 150
    wrongness = (100-150)^2 + (120-150)^2 + ... + (200-150)^2 = 7000

    after split:  {100,120,140} mean 120, wrong 800
                  {160,180,200} mean 180, wrong 800
    total = 1600

    drop = 7000 - 1600 = 5400   <- this doctor's worth

  Wrongness is NOT leaf-only.  Any room can be graded by "pretend it's a leaf."


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    mixed-ness / how mixed                Gini impurity
    the drop in mixing                    information gain
    share yes / share no                  class proportion / p, q
    pure side                             a node with Gini = 0 (homogeneous)
    half-half (worst)                     maximum Gini = 0.5 (for 2 classes)
    the 4 counts                          class counts per child node
    the weighted combine                  weighted average Gini of children
    the alpha tax                         cost-complexity pruning (ccp_alpha)
    the staircase of snip-points          the effective-alpha sequence
    fire a doctor                         prune a subtree
    keep the sick:well ratio even         stratified train/test split


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       DecisionTreeClassifier()     -- a question-chart machine for bin answers
       criterion='gini'             -- the mixing ruler (default)
       cost_complexity_pruning_path -- the staircase of alpha snip-points
       KFold                        -- the 5-slice splitter for the alpha check

    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score, confusion_matrix

    # classification tree
    tree_cls = DecisionTreeClassifier(criterion='gini', random_state=42)
    tree_cls.fit(X_train_cls, y_train_cls)
    acc = accuracy_score(y_test_cls, tree_cls.predict(X_test_cls))

    # pruning: get alpha menu, 5-slice check, pick winner
    path = DecisionTreeRegressor(random_state=42).fit(X_train, y_train) \
           .cost_complexity_pruning_path(X_train, y_train)
    ccp_alphas = path.ccp_alphas   # the staircase

    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    # for each alpha: 5-fold CV -> mean MSE -> pick argmin -> alpha_hat
    # then: tree_pruned = DecisionTreeRegressor(ccp_alpha=alpha_hat, ...)
    #       tree_pruned.fit(X_train, y_train)  <- rebuild with the tax


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 5 -- Question Charts and Committees):
    Part 1 -- Question Charts by Hand .
    Part 2 (this post) .
    Part 3 -- Committees

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================