==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 5 . QUESTION CHARTS AND COMMITTEES . PART 2 OF 3
The Mixing Ruler: Gini, Information Gain, and Pruning
Posted: 2026-06-07 . Author: Rahul Rai . Tags: gini, information-gain, pruning, classification
============================================================================================
PATH . post 14 of 28
<- prev: Chapter 5, Part 1: Question Charts by Hand
next: Chapter 5, Part 3: Committees ->
Part 1 built the question chart for a NUMBER answer. Each cut's badness was the sum of
squared misses around the pile means -- a formula that made sense because the answer was
a number you could average. Now the answer is yes/no -- sick or well, 0 or 1 -- and
averaging stops making sense. You cannot take the mean of {yes, yes, no}.
So the badness ruler must change. The new ruler counts how MIXED each side is (how many
yes vs how many no), and the chart picks the cut that leaves the sides LEAST mixed.
That ruler is Gini impurity, and this post derives it by hand from four counts.
The second half of this post solves a different problem: even a good tree grows too bushy
and memorises noise. Pruning snips the weak branches after the fact, and the alpha tax
decides which branches to fire.
## The Classification Tree: One Change at the Bottom
The build hunt is IDENTICAL to Part 1 -- sort, midpoints, champion, champion of
champions, recurse. Two tiny things change:
NUMBER tree (Part 1):
each final pile -> AVERAGE of the answers -> one number guess
badness = sum of squared misses around that average
YES/NO tree (this post):
each final pile -> MAJORITY vote -> a bin (yes or no)
badness = how MIXED each side is (Gini)
same hunt, same recursion
only the "ruler for badness" and the "guess at the bottom" differ
## From 4 Counts to 1 Number: The Gini Ruler
At the first pairwise midpoint from the sorted column, draw a line:
below: 3 yes, 1 no (4 people)
above: 1 yes, 4 no (5 people)
----------
9 people total. 4 counts. that's ALL you need.
First, write each count as a fraction (the "share"):
below: yes = 3 out of 4 = 3/4 . no = 1 out of 4 = 1/4
above: yes = 1 out of 5 = 1/5 . no = 4 out of 5 = 4/5
Then compute each side's mixed-ness.
Each side: "how likely are two random grabs to DISAGREE?"
mixed-ness = 1 - (share_yes)^2 - (share_no)^2
all match match
For the BELOW side (4 people: 3 yes, 1 no):
share_yes = 3/4 . squared = 3/4 * 3/4 = (3*3)/(4*4) = 9/16
share_no = 1/4 . squared = 1/4 * 1/4 = (1*1)/(4*4) = 1/16
match = 9/16 + 1/16 = 10/16 = 5/8
differ = 1 - 5/8 = 3/8 = 0.375
(By pencil: split 8 into 8 eighths. 5/8 match, so 3/8 differ.)
For the ABOVE side (5 people: 1 yes, 4 no):
share_yes = 1/5 . squared = 1/5 * 1/5 = (1*1)/(5*5) = 1/25
share_no = 4/5 . squared = 4/5 * 4/5 = (4*4)/(5*5) = 16/25
match = 1/25 + 16/25 = 17/25
differ = 1 - 17/25 = 8/25 = 0.32
(25/25 - 17/25 = 8/25. 8 / 25 = 0.32, since 25 * 0.32 = 8.)
Then combine the two sides into one badness number, weighted by
how many people are on each side.
below has 4 people out of 9 total -> weight = 4/9
above has 5 people out of 9 total -> weight = 5/9
cut badness = (4/9)*(3/8) + (5/9)*(8/25)
First term: 4/9 * 3/8 = (4*3)/(9*8) = 12/72 = 1/6
Second term: 5/9 * 8/25 = (5*8)/(9*25) = 40/225 = 8/45
Add: 1/6 + 8/45
= 15/90 + 16/90 (common denominator: 90. 1/6=15/90, 8/45=16/90)
= 31/90 ~ 0.344
There it is. 4 counts -> one number (0.344) for this cut.
>> YOUR TURN
A different side holds 2 yes and 3 no (made-up). Work its Gini on the slate.
check your slate: shares -- yes 2/5 = 0.4, no 3/5 = 0.6; Gini = 1 - 0.4^2 -
0.6^2 = 1 - 0.16 - 0.36 = 0.48. Close to the half-half worst of 0.5 -- this
side is badly mixed, a poor place to stop.
pure side (all yes): 0 <- can't be cleaner
half-half: 0.5 <- worst possible
this cut: 0.344 <- somewhere in between
each Gini costs only a handful of strokes, but the hunt is wide: 330 cuts x 10
columns = 3,300 Gini numbers at THIS node alone, and a tree grows many nodes --
clerk-work that stacks up fast as the chart deepens.
each of the 330 cuts -> one number
smallest number = this column's champion (the cleanest split)
## A Real Cancer Cut: Radius and Texture
The earlier example used abstract "yes" and "no." Here is the same ruler
tied to the actual cancer sheet the lab works with.
Four lumps, two measurements (radius and texture), one answer:
lump radius texture answer
---------------------------------
p1 15 0.20 sick
p2 8 0.10 well
p3 22 0.30 sick
p4 10 0.15 well
The tree looks for the cleanest cut. Try radius first.
Cut radius > 12:
Below: p2(well), p4(well) -> pure well -> GINI = 0 (weight 2/4)
Above: p1(sick), p3(sick) -> pure sick -> GINI = 0 (weight 2/4)
Weighted GINI = 0.5 x 0 + 0.5 x 0 = 0 <- perfect, no mixing left
Cut radius > 15:
Below: p2(well), p4(well), p1(sick) -> 1 sick, 2 well
share_sick = 1/3, share_well = 2/3
GINI = 1 - (1/3)^2 - (2/3)^2 = 1 - 1/9 - 4/9 = 4/9 ~ 0.444
Weight = 3/4 = 0.75
Above: p3(sick) -> pure sick -> GINI = 0 (weight 1/4)
Weighted GINI = 0.75 x 0.444 + 0.25 x 0 = 0.333
Champion for radius: cut > 12 (GINI = 0). Now try texture.
Cut texture > 0.12:
Below: p2(well) -> pure well -> GINI = 0 (weight 1/4)
Above: p1(sick), p3(sick), p4(well) -> 2 sick, 1 well
share_sick = 2/3, share_well = 1/3
GINI = 1 - (2/3)^2 - (1/3)^2 = 1 - 4/9 - 1/9 = 4/9 ~ 0.444
Weight = 3/4 = 0.75
Weighted GINI = 0.25 x 0 + 0.75 x 0.444 = 0.333
Champion of champions:
column cut weighted GINI
---------------------------------
radius > 12 0 <- champion
texture > 0.12 0.333
First question: "radius > 12?" -> pure sick piles and pure well piles.
Then each pure pile stops (GINI = 0, nothing left to clean).
The abstract "3 yes, 1 no" from earlier is the same math -- just wrapped
in real column names from the cancer lab the machine actually touches.
IN HAND: a cut turned 4 counts into one mixed-ness number -- below 3 yes 1 no gave
0.375, above 1 yes 4 no gave 0.32, weighted into 0.344 -- where 0 is pure and 0.5 is
half-half. This section shows that 1 - p^2 - q^2 is no magic spell.
>> YOUR TURN
A cut drops the parent's mixing from 0.48 down to a weighted-children 0.30 (both
made-up). How much did the cut buy -- its information gain?
check your slate: gain = parent Gini - weighted child Gini = 0.48 - 0.30 = 0.18.
A positive gain means the cut left the two sides cleaner than the parent; the
champion cut is the one with the biggest gain.
## Why That Formula Is Not Astrology
The formula 1 - p^2 - q^2 looks pulled from thin air. It isn't. It counts
something you can see with your hands.
Gini = the chance two random grabs from a side DISAGREE.
Using the below side: 3 yes, 1 no (4 people). You reach into the pile,
grab one person (write down yes or no), PUT THEM BACK, grab another.
Chance BOTH yes?
first grab: "3 out of 4 are yes" -> 3/4 chance
second grab: same pile (you put the first back), same odds -> 3/4
"OF" means MULTIPLY. Three-quarters OF three-quarters:
3/4 * 3/4 = (3*3)/(4*4) = 9/16
Why multiply? Because you need BOTH events to happen.
"both yes" = "first yes AND second yes" = multiply the fractions.
(By pencil: 3/4 of 3/4 is 9/16 of the whole. You can draw a 4-by-4
grid, shade 3 across and 3 down -- 9 shaded squares out of 16.)
Chance BOTH no?
1/4 * 1/4 = 1/16
Chance they MATCH (both-yes OR both-no)?
"OR" means ADD. 9/16 + 1/16 = 10/16 = 5/8
Chance they DIFFER (don't match)?
1 - 5/8 = 3/8 = 0.375
That 0.375 is the Gini for this side. No formula pulled from thin air --
just counting what happens when you grab two people from the pile.
Check the edges:
pure side (all yes): 4 yes, 0 no
both yes = 4/4 * 4/4 = 1. both no = 0. match = 1. differ = 0. OK.
half-half: 2 yes, 2 no
both yes = 2/4 * 2/4 = 4/16. both no = 2/4 * 2/4 = 4/16. match = 8/16 = 1/2.
differ = 1 - 1/2 = 1/2 = 0.5. OK.
Gini is the DEFAULT ruler because it is fast (no logarithms). The other option
-- entropy (cross-entropy) -- uses a log instead of squares and often gives the
same champion. sklearn's default is criterion='gini' for speed.
criterion='gini' -> 1 - p^2 - q^2 fast (multiply and subtract)
criterion='entropy' -> -p log p - q log q slow (log is expensive)
## Information Gain: The Drop
A cut's worth is how much it DROPS the mixing from parent to children:
gain = Gini(parent) - weighted_Gini(children)
parent: whole pile, Gini = 0.49
after cut: left 0.375, right 0.32, weighted = 0.344
gain = 0.49 - 0.344 = 0.146
Higher gain = cleaner split. The champion is the cut with the HIGHEST gain -- same
as picking the LOWEST weighted Gini, just measured from the other direction.
## Stratify: Keep the Mix Even
The cancer sheet is ~63% well, ~37% sick. A plain random split might deal 71% well
into the exam pile by luck -- unfair. stratify=y_cls forces both piles to mirror the
whole sheet's ratio:
plain split (luck): stratify split (forced):
study 60% well study 63% well <- same as whole
exam 71% well <- skewed exam 63% well <- same as whole
One flag in the split call. No new idea -- just insurance against a lopsided deal.
## Pruning: The Alpha Tax
A full-grown tree memorises every quirk. The exam error is high. Instead of capping
depth (a blunt ceiling), pruning SNIPS weak branches after the tree is fully grown.
The idea: charge a TAX per end-room (leaf). A branch survives only if the wrongness
it removes is worth more than the tax:
chart's total cost = (total wrongness on study people) + alpha x (number of leaves)
+-- fit the data ---+ +--- stay humble ---+
alpha = 0 -> leaves are FREE -> keep all -> overgrown, memorises
alpha small -> cheap -> snip only the weakest twigs
alpha big -> leaves expensive -> snip hard -> down toward a stump
a doctor splits a room into 2:
wrongness dropped by 5400? tax for +1 room = alpha
if 5400 > alpha -> keep the doctor (earns its keep)
if 5400 < alpha -> FIRE him, merge rooms back (not worth the tax)
** KEY: MAX_DEPTH IS A CEILING; CCP_ALPHA IS A TAX
max_depth: chops ALL corridors at height N. blunt, same haircut everywhere.
ccp_alpha: snips weak twigs ANYWHERE, keeps strong deep branches. smart, uneven.
## Where Does the Alpha Menu Come From?
You do NOT guess random alpha values. The tree hands you the exact list of alphas
where a branch would snip:
cost_complexity_pruning_path(X_train, y_train) -> ccp_alphas
[0.0, 12.5, 80.0, ..., 1755] <- the staircase of snip-points
Between these alphas, nothing changes. The tree pre-computes them.
!! WARN: CCP_ALPHAS (THE LIST) vs CCP_ALPHA (THE KNOB)
ccp_alphas = a list the machine hands you (all the meaningful snip-points)
ccp_alpha = a single knob you set on a new tree ("how hard to prune")
Plural = the menu. Singular = the setting.
A REAL alpha menu from the diabetes tree:
alpha what it does leaves left
-------------------------------------------------------------
0.0 no snips, full-grown tree 331 leaves
12.5 snipped 1 twig (drop < 12.5) 330
80.0 a whole weak branch collapses 310
204.7 another branch, ~20 leaves gone 290
550.0 half the tree gone 160
1755.0 everything collapsed to a stump 1
Each alpha is ONE branch's saved-mess. The tree pre-computes them by
walking bottom-up: for each internal room, compute "how much wrongness
does this room's children remove?" That number = the alpha where that
branch gets snipped. The tree doesn't invent alphas -- it reads them
off its own branches.
!! WARN: SNIP != REBUILD
The tree was ALREADY fully grown. Pruning collapses rooms; it does
NOT rebuild from scratch. No new .fit. Just room closure.
!! WARN: ALPHA VALUES COME FROM THE TREE, NOT THE TEACHER
You do not guess alpha=[0.1, 1, 10]. The tree hands you exact values
where branches would fire. Your job is to pick WHICH one works best
on the exam -- not to make up alphas.
## The Nested Averages Ladder
To pick the best alpha, the same 5-slice check from Chapter 4. Three averages,
stacked, each doing a different job:
RUNG 1 -- one person's guess (mean of targets in his leaf):
leaf holds {120, 140, 160} -> guess = 140
RUNG 2 -- one round's error (mean of misses over covered people):
person guess truth miss^2
p1 140 150 100
p2 90 80 100
p3 210 220 100
round error = mean = (100+100+100)/3 = 100
RUNG 3 -- one alpha's score (mean of the 5 round-errors):
round1 round2 round3 round4 round5
100 120 90 110 80
alpha's score = 500/5 = 100
TOP -- pick best alpha (MIN over scores, NOT an average):
alpha=12.5 -> 105
alpha=80.0 -> 100 <-- smallest -> alpha_hat
alpha=300 -> 140
targets in a leaf --avg--> guess (rung 1)
misses over people --avg--> round error (rung 2)
5 round-errors --avg--> alpha score (rung 3)
alpha scores --MIN--> alpha_hat (top, not avg)
## Common Tripwires I Caught
These are the exact wrong pictures I had to untangle -- each one cost me
real time until I saw the concrete shape of the mistake:
TRIPWIRE 1: ccp_alphas (the list) == ccp_alpha (the knob)
WRONG: they're the same thing, just abbreviated differently.
RIGHT: ccp_alphas = a list of numbers the machine hands back (the menu).
ccp_alpha = one number YOU set on a new tree (the setting).
Plural = menu. Singular = knob. Two different things.
TRIPWIRE 2: pruning_path needs a fitted tree first
WRONG: you must .fit a tree, then call pruning_path on it.
RIGHT: cost_complexity_pruning_path(X, y) builds its OWN tree
internally just to read the menu. No .fit needed.
The call IS the builder.
TRIPWIRE 3: alpha values are invented by the teacher
WRONG: someone picked [0, 0.01, 0.1, 1] as reasonable guesses.
RIGHT: the tree computes each alpha from its own branches.
Each alpha = one branch's saved-mess. The teacher never
touches the numbers.
TRIPWIRE 4: snip a branch = rebuild the tree
WRONG: after pruning, the tree refits from scratch on the pruned
structure.
RIGHT: the tree was ALREADY fully grown. ccp_alpha just collapses
rooms. No new .fit. No rebuild. Just closure.
TRIPWIRE 5: in boosting, you sort the leftover column each round
WRONG: each round, you re-sort bmi/age/bp against the new leftover
column to find the best cut.
RIGHT: you sort bmi/age/bp the SAME way every round. The column
order never changes. Only the TARGET column changes
(answer -> leftover1 -> leftover2 -> ...). Sorting the
leftover column is a waste -- the tree asks "which column
gives the cleanest cut", and the leftover column is not one
of the 10 body measurements.
TRIPWIRE 6: Rung 2 and Rung 3 are the same type of average
WRONG: they're both "mean of some numbers."
RIGHT: Rung 2 averages misses ACROSS PEOPLE in one fold.
Rung 3 averages fold-errors ACROSS FOLDS for one alpha.
Different granularity, different job. Rung 2 = per-fold,
Rung 3 = cross-fold. Mixing them is the classic cross-val
confusion.
## Any Room Has a Wrongness -- Not Just Leaves
"You only measure wrongness at the end -- how can you measure it in the middle?"
Every room has a wrongness by asking "what if I stopped here?" Its guess would be
the mean of everyone in the room:
middle room, 6 people, targets {100,120,140,160,180,200}
if I STOP here: guess = mean = 150
wrongness = (100-150)^2 + (120-150)^2 + ... + (200-150)^2 = 7000
after split: {100,120,140} mean 120, wrong 800
{160,180,200} mean 180, wrong 800
total = 1600
drop = 7000 - 1600 = 5400 <- this doctor's worth
Wrongness is NOT leaf-only. Any room can be graded by "pretend it's a leaf."
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
mixed-ness / how mixed Gini impurity
the drop in mixing information gain
share yes / share no class proportion / p, q
pure side a node with Gini = 0 (homogeneous)
half-half (worst) maximum Gini = 0.5 (for 2 classes)
the 4 counts class counts per child node
the weighted combine weighted average Gini of children
the alpha tax cost-complexity pruning (ccp_alpha)
the staircase of snip-points the effective-alpha sequence
fire a doctor prune a subtree
keep the sick:well ratio even stratified train/test split
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
>> NEW TO PYTHON? Each named once:
DecisionTreeClassifier() -- a question-chart machine for bin answers
criterion='gini' -- the mixing ruler (default)
cost_complexity_pruning_path -- the staircase of alpha snip-points
KFold -- the 5-slice splitter for the alpha check
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# classification tree
tree_cls = DecisionTreeClassifier(criterion='gini', random_state=42)
tree_cls.fit(X_train_cls, y_train_cls)
acc = accuracy_score(y_test_cls, tree_cls.predict(X_test_cls))
# pruning: get alpha menu, 5-slice check, pick winner
path = DecisionTreeRegressor(random_state=42).fit(X_train, y_train) \
.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas # the staircase
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# for each alpha: 5-fold CV -> mean MSE -> pick argmin -> alpha_hat
# then: tree_pruned = DecisionTreeRegressor(ccp_alpha=alpha_hat, ...)
# tree_pruned.fit(X_train, y_train) <- rebuild with the tax
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 5 -- Question Charts and Committees):
Part 1 -- Question Charts by Hand .
Part 2 (this post) .
Part 3 -- Committees
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================