==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 3 . SORTING INTO BINS . PART 4 OF 4
Picking Settings, Skewed Piles, and Averaging Many Classes
Posted: 2026-06-05 . Author: Rahul Rai . Tags: grid-search, precision-recall, class-imbalance
============================================================================================
PATH . post 9 of 28
<- prev: Sorting 3: Leash and Cloud
next: Chapter 4, Part 1: The Leash (Ridge & Lasso) ->
So far the world has been tidy: one sheet of breast lumps, one clean yes-or-no question,
two groups of roughly equal size. Real problems are rarely so polite. This closing post
drags the series out into the messy world and arms you for four things it will throw at
you.
How do you tune a machine's settings without cheating by peeking at the exam? What do
you do when one bin outnumbers the other ninety-nine to one and your proud 99% accuracy
turns out to be worthless? Which ruler do you trust when an outlier wanders in? And how
do you grade a machine that is sorting into not two bins but ten? Answer these and you
have left the textbook behind and stepped onto the workshop floor.
## Settings I Pick By Hand (Hyperparameters)
Some numbers inside a machine are set by rolling downhill on the leftover -- the dials
beta in the S-curve machine. Others are decided BEFORE training starts and are never
touched by the leftover: the leash strength C, the neighbour-count k, the depth of a
decision tree. These are settings you pick by hand. The machine cannot tune them for
itself because the leftover doesn't flow back through them.
dials (beta): tuned by the machine as it trains <- the machine adjusts these
settings: chosen by you before training starts <- the machine never sees these
examples:
C in LogisticRegression(C=0.1) <- leash strength
k in KNeighborsClassifier(k=5) <- how many neighbours to ask
n_estimators in RandomForest <- how many trees to grow
## The Grid Hunt
You have several settings, each with several candidate values. A grid lists every
combination. You try each one and measure how well it does -- but you must not use the
sealed exam pile to measure. The fix is rotating folds: split the study pile into, say, 5
equal strips; for each combination, train on 4 strips, score on the 5th, rotate which
strip is left out, repeat, then average the 5 scores.
grid for C and k:
+-------+----+----+----+
| |C=.1|C=1 |C=10|
+-------+----+----+----+
| k = 3 | . | . | . | each cell = a combination
| k = 5 | . | . | . | each dot = average of 5 fold scores
| k = 7 | . | . | . | pick the cell with the best score
+-------+----+----+----+
Three things go into making a grid hunt trustworthy:
1. The range and spacing of each setting. If C should be 0.01 the grid must include
0.01; a grid of {1, 10, 100} misses it entirely. Both the candidate values and how
finely you space them are decisions you make up front.
2. Rotating folds (cross-validation) to measure each combination reliably. A single
train/score split might be lucky or unlucky. Rotating over 5 strips gives 5
independent readings and averages out that luck.
3. A scoring rule that matches the actual goal: accuracy if all mistakes cost the same;
recall if missing sick people is catastrophic; F1 if both false alarms and misses
matter.
A concrete 5-fold example for C=0.1 vs C=1.0, by pencil. Study pile has
100 lumps. Split into 5 strips of 20 each.
strip C=0.1 recall C=1.0 recall
---------------------------------------
1 0.92 0.95
2 0.88 0.90
3 0.94 0.89
4 0.90 0.93
5 0.86 0.82
average: (0.92+0.88+0.94+0.90+0.86)/5 = 4.50/5 = 0.900 (C=0.1)
(0.95+0.90+0.89+0.93+0.82)/5 = 4.49/5 = 0.898 (C=1.0)
C=0.1 wins by 0.002. Without the 5-fold average, you might
have picked strip 5 where C=1.0 scores only 0.82 -- a lucky
stick in that strip's eye. The fold average smooths out the
luck and shows the true tendency: C=0.1 is slightly better
at recall on this sheet.
>> YOUR TURN
A grid pairs 4 candidate C-values with 3 candidate k-values, and scores every
pair by 5-fold rotation (made-up). How many full model-fits is that?
check your slate: pairs = 4 x 3 = 12; fits = 12 x 5 folds = 60 complete fits.
Each fit is a whole downhill solve -- this is why a clerk-room, not a single
clerk, runs a grid hunt, and why the grid's size is a cost you choose.
In code the whole hunt is a Pipeline -- scaler plus classifier -- handed to GridSearchCV;
that code is at the end of the post, and it carries one subtle trap worth stating now:
!! WARN: BUILD THE RULER INSIDE EACH FOLD, NOT BEFORE THE HUNT
A subtle leak: if you scale all of X_train ONCE up front and then pass that pre-scaled
pile to GridSearchCV, every validation strip has already helped shape the ruler (its
mean and spread). The strip you score on has leaked into the scaling, so the fold
scores come out optimistic -- even though the final exam pile (X_test) is still
sealed. The fix is to put the scaler INSIDE a Pipeline and pass raw X_train: now the
ruler is rebuilt from each fold's 4 training strips only, and the held-out strip is
measured by a ruler it never touched. Same peeking rule from Part 1, one level deeper.
>> NOTE: THE EXAM PILE IS STILL SEALED
GridSearchCV only ever touches the study pile (X_train). The exam pile (X_test) is
opened exactly once at the very end to report the final score. If you tune settings on
the exam pile, you are leaking future information and the final score is optimistic --
the machine has already peeked.
## Two Kinds of Same-Ruler
Parts 1-3 used the standard ruler: shift each column to mean 0, divide by its spread.
There is a second popular ruler -- pinch-to-fit -- that squishes every column's values to
sit strictly between 0 and 1.
standard ruler: x~ = (x - mean) / spread -> centred at 0, spread ~= 1
pinch-to-fit: x~ = (x - min) / (max - min) -> bounded in [0, 1]
!! WARN: PINCH-TO-FIT IS FRAGILE AROUND OUTLIERS
One outlier stretches the range. Say house areas run 500-3000 sq ft but one mansion
hits 15000. The (max-min) denominator becomes 14500. Every other house gets squished
into the bottom 17% of [0, 1]. The outlier sits at 1.0; the rest cluster near 0. The
ruler works mathematically but destroys the relative spacing of normal values. The
standard ruler shifts and stretches by the spread, so one distant outlier only weakly
pulls the spread, and the bulk of the data keeps its shape.
area values: 500, 520, 3000, 15000
pinch-to-fit (max=15000, min=500):
500 -> 0.000
520 -> 0.001 <- normal house, near 0
3000 -> 0.172 <- large house, still near 0
15000 -> 1.000 <- outlier, fills the far end
standard ruler (mean ~= 4755, spread ~= 6150):
500 -> -0.70
520 -> -0.69 <- normal houses spread out
3000 -> -0.29
15000 -> 1.67 <- outlier: unusual but doesn't crush others
For the same-ruler rule (build from study only, apply to both), the same peeking logic
applies: compute min, max (or mean, spread) from the study pile only.
## When ALARM Matters More Than MISSED
Part 1 argued that in cancer screening, recall is the north star because a MISSED sick
person is catastrophic. In other settings, the ALARM is the catastrophe.
precision v when ALARM is large <- false positives pile up
recall v when MISSED is large <- false negatives pile up
cancer screening: MISSED = cancer goes untreated -> prioritise recall
legal evidence: ALARM = innocent person jailed -> prioritise precision
email spam: ALARM = real email deleted -> precision matters more
fraud detection: MISSED = fraud slips through -> loss -> recall matters
** KEY: PRECISION REDUCES ALARMS; RECALL REDUCES MISSES
Choosing which to prioritise is a business or clinical decision, not a machine
decision. The machine gives you a curve of deals (the trade curve from Part 2); you
pick the deal that matches what each kind of mistake actually costs.
In fraud detection, MISSED = a fraudulent transaction slips through and the business
absorbs the loss. ALARM = a legitimate transaction is blocked and the customer is
frustrated. For most fraud systems the financial loss of a missed fraud is worse than the
friction of a blocked legitimate purchase, so recall is the primary score -- but both
matter, and precision-recall curves (next) show the exact deal between them.
## Skewed Piles and the Precision-Recall Curve
The breast lump sheet leaned 63% well to 37% sick -- tilted, but you could still stand on
it. Now imagine the floor pitched almost vertical. A real fraud sheet might run 99.9%
honest and 0.1% crooked. That is a SKEWED pile, and on ground this steep the ordinary
scores quietly start lying to you.
skewed fraud sheet: 100000 rows
99900 normal transactions (class 0)
100 fraud transactions (class 1)
fool machine: always call "normal"
accuracy = 99900 / 100000 = 99.9% <- looks extraordinary
recall = 0 / 100 = 0.0% <- catches no fraud at all
The ROC curve (trade curve) also has a problem with skewed piles. FPR divides ALARM by
all truly-normal rows -- that denominator is 99900. Even if the machine raises 2000 false
alarms, FPR = 2000/99900 = 0.02, which looks small and makes the ROC curve bulge
optimistically. The 2000 blocked customers are invisible in that fraction.
ROC curve is blind to class imbalance because:
FPR = ALARM / (ALARM + CLEAR) <- denominator is huge; ALARM looks tiny
Precision-recall curve avoids this:
precision = CAUGHT / (CAUGHT + ALARM) <- no TN in sight; focuses on the sick pile
recall = CAUGHT / (CAUGHT + MISSED) <- also ignores TN
The precision-recall curve sweeps the same cutoff as the trade curve but plots precision
on the Y-axis and recall on the X-axis. A perfect machine hugs the top-right corner (high
precision AND high recall). A flat line near precision = (fraction of sick in the sheet)
is the baseline -- the always-shout-sick fool.
precision
1 | # <- perfect (catch all, zero false alarms)
| ##
| ##
| - - - - <- baseline (= fraction of sick in the sheet)
0 +----------- recall
0 1
area under precision-recall curve -> average precision (AP)
** KEY: USE PRECISION-RECALL CURVES WHEN THE PILE IS SKEWED
When one class is rare (fraud, disease in a healthy population, defects on a line),
the precision-recall curve is more informative than the ROC curve because it does not
use the true-negative count at all. The ROC curve's FPR is diluted by the massive well
pile; the precision-recall curve ignores that pile entirely.
Computing that curve, and its area (the average precision), is three lines -- waiting in
the code at the end of the post.
## Scoring With Three or More Bins
Cancer has subtypes. Handwritten digits have ten classes. Sentiment has three: positive,
neutral, negative. Now there are multiple bins and one machine must sort into all of
them. The four-box table expands into a K x K grid (one row and column per class). Each
class gets its own precision, recall, and F1. The question becomes: how do you average
those K scores into one number?
IN HAND: a three-class scorecard -- type A F1 0.89, type B 0.67, type C 0.34 -- with
class C holding only 10 rows against the others' 1000. This section asks how to roll
three F1 numbers into one, and shows the two answers disagree on purpose.
## Two Averaging Rules
three-class example: cancer type A, B, C
+---------+-----------+--------+------+
| class | precision | recall | F1 | n (study rows)
+---------+-----------+--------+------+
| type A | 0.90 | 0.88 | 0.89 | 1000
| type B | 0.70 | 0.65 | 0.67 | 1000
| type C | 0.30 | 0.40 | 0.34 | 10
+---------+-----------+--------+------+
TREAT-ALL-CLASSES-EQUAL averaging (macro): compute precision, recall, F1 per class, then
take the plain average across classes with equal weight.
macro F1 = (0.89 + 0.67 + 0.34) / 3 = 0.63
>> YOUR TURN
Three classes score F1 of 0.80, 0.60, 0.40 (made-up). Work the macro F1 -- the
equal-weight average.
check your slate: macro F1 = (0.80 + 0.60 + 0.40) / 3 = 1.80 / 3 = 0.60. Macro
weights a 10-row class the same as a 1000-row class, so a weak rare class drags
the score down on purpose -- that is the honest per-class picture.
class C (only 10 rows) gets the same weight as class A (1000 rows)
-> a low F1 on class C drags the average down strongly
-> honest picture of per-class performance regardless of class size
COUNT-EVERY-LABEL averaging (micro): pool all CAUGHT, ALARM, and MISSED counts across
every class, then compute one precision, one recall, one F1 from those pooled totals.
micro pools raw counts:
total CAUGHT across A+B+C = 880 + 650 + 4 = 1534
total ALARM across A+B+C = 97 + 278 + 9 = 384
total MISSED across A+B+C = 120 + 350 + 6 = 476
micro precision = 1534 / (1534 + 384) = 0.80
micro recall = 1534 / (1534 + 476) = 0.76
class A (1000 rows) dominates; class C (10 rows) barely registers
Method How it weights classes Use when
-------------------- -------------------------------- --------------------------
Macro (equal weight) every class counts the same all classes equally
important; or skewed pile
Micro (count-weight) large classes dominate overall accuracy on all
labels matters
** KEY: MACRO WHEN CLASSES ARE EQUALLY IMPORTANT OR PILE IS SKEWED
With a skewed pile, micro averaging is swamped by the majority class. If type C cancer
(10 patients) scores F1=0.34 and the machine is otherwise excellent on the common
types, micro averaging buries type C's failure. Macro averaging gives type C equal
weight and forces the failure to show. That is the right signal when you care about
catching every cancer type regardless of how rare it is.
One call -- classification_report -- prints every per-class score plus the macro and
weighted averages (code at the end). It hides one surprise worth flagging right here:
>> NOTE: WHY classification_report SHOWS "accuracy", NOT "micro avg"
On ordinary single-label data (each row has exactly one true class), every false
positive for one class is simultaneously a false negative for another -- the pooled
counts make micro precision, micro recall, and micro F1 all collapse to the same
number: plain accuracy. So classification_report prints an "accuracy" row instead of a
"micro avg" row. If you want the micro metric under its own name, compute it with
precision_recall_fscore_support(..., average='micro'). (A true "micro avg" row only
appears for multi-label problems, where the collapse doesn't happen.)
## Summary: Which Tool for Which Problem
+------------------------------+------------------------------------------+
| Situation | Reach for |
+------------------------------+------------------------------------------+
| tuning C, k, depth | grid hunt + rotating folds (GridSearchCV |
| | with cv=5) scored on a relevant metric |
| columns on wild scales | standard ruler (mean 0, spread 1) |
| columns, but outliers present| standard ruler beats pinch-to-fit |
| balanced bins | ROC / AUC; accuracy OK baseline |
| skewed pile (fraud, rare dz) | precision-recall curve + average prec. |
| ALARM is the catastrophe | optimise precision |
| MISSED is the catastrophe | optimise recall |
| 3+ classes, all equally imp. | macro averaging |
| 3+ classes, size = importance| micro (or weighted) averaging |
+------------------------------+------------------------------------------+
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
Three pieces, in the order the post met them: the grid hunt (done safely, scaler inside
the pipeline), the precision-recall curve for skewed piles, and the many-bin scores.
>> NEW TO PYTHON? Each named once:
Pipeline([...]) -- chain steps so they are refit together, leak-free
{'clf__C': [...]} -- a dict of settings to try; 'clf__C' names a step's knob
d['key'] -- look a value up in a dict by its name
The grid hunt -- a Pipeline of scaler + classifier, handed to GridSearchCV. Pass RAW
X_train, so the scaler is rebuilt inside each fold and nothing leaks:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# scaler lives INSIDE the pipeline so it is rebuilt on each fold's
# training strip only -- never on the strip being scored
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression()),
])
param_grid = {'clf__C': [0.01, 0.1, 1, 10], 'clf__penalty': ['l2']}
gs = GridSearchCV(pipe, param_grid, cv=5, scoring='recall')
gs.fit(X_train, y_train) # pass RAW X_train, not pre-scaled
best_C = gs.best_params_['clf__C']
The precision-recall curve for a skewed pile, and its area (average precision):
from sklearn.metrics import precision_recall_curve, average_precision_score
y_proba = model.predict_proba(X_test_scaled)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba) # area under the curve
And the many-bin scores -- one report, plus the micro/macro numbers asked for by name:
from sklearn.metrics import classification_report, precision_recall_fscore_support
print(classification_report(y_test, y_pred,
target_names=['type A', 'type B', 'type C']))
# per-class precision/recall/F1, then: accuracy, macro avg, weighted avg
# note: for ordinary single-label multiclass, micro precision = micro recall
# = micro F1 = accuracy, so classification_report does NOT print a "micro avg"
# row -- it prints "accuracy" instead. To see the micro number by name, ask for it:
micro = precision_recall_fscore_support(y_test, y_pred, average='micro')
macro = precision_recall_fscore_support(y_test, y_pred, average='macro')
# micro[:3] == (accuracy, accuracy, accuracy); macro[:3] == the equal-weight blend
## Where That Leaves Us
And there the series rests. Look back at the road: we began by turning a sliding number
into a yes-or-no verdict, learned why a machine that catches nobody can still boast 63%
accuracy, slid the cutoff to see every bargain between lives and money at once, leashed
an over-confident machine and met a calmer one that draws its wall in a single stroke,
and finally walked out into the skewed, many-binned, outlier-ridden world where the
textbook scores stop being enough.
None of it required Greek you had not earned. That was the whole point. The labels --
logistic regression, ROC, regularisation, LDA, cross-validation -- were never the
understanding; they were handles screwed on at the end, once the thing was already built
and working in your hands. Keep that order and there is very little in this field you
cannot teach yourself.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
setting I pick by hand hyperparameter
grid hunt grid search / GridSearchCV
rotating folds k-fold cross-validation
standard ruler standard scaling / StandardScaler
pinch-to-fit ruler min-max scaling / MinMaxScaler
skewed pile class imbalance / imbalanced dataset
precision-catch curve precision-recall curve
area under precision-recall curve average precision (AP)
treat-all-classes-equal averaging macro averaging
count-every-label averaging micro averaging
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 3 -- Sorting Into Bins):
Part 1 -- The S-Curve, the Four-Box Table .
Part 2 -- The Trade Curve .
Part 3 -- Leash and Cloud .
Part 4 (this post)
Appendix: Classification Reference -- all terms in one place
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================