==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 3 . SORTING INTO BINS . PART 1 OF 4
Sorting Into Bins: The S-Curve, the Four-Box Table, and Why Accuracy Lies
Posted: 2026-06-05 . Author: Rahul Rai . Tags: logistic-regression, classification, evaluation
============================================================================================
PATH . post 6 of 28
<- prev: Grading 2: Reading the Dials
next: Sorting 2: The Trade Curve ->
Every post so far ended with a number: a house worth 2.8, a car that does 18 to the
gallon. This one ends with something heavier -- a verdict. Picture the room. A doctor
has photographed the cells from a breast lump and measured thirty things about them:
size, texture, smoothness. The question on the table is not "how malignant?" It is the
one that keeps people up at night: malignant, OR NOT? Two bins. One answer. And a
wrong answer that can cost a life.
The machine still does what it always did -- it brews a sliding number inside. But that
number can no longer just be read aloud. It has to be bent into a chance, then chopped
into a label, and from here on every choice we make is haunted by which kind of mistake
we can least afford to make.
## From Number to Bin
previous posts (number): ----------*------> answer = 2.83
this post (bin): [ B = well ] [ M = sick ] answer = ONE box
The dial-adding machine can only produce a sliding number. To land in one of two bins it
needs two more steps: a SQUASH that bends any number into the range 0-1 (so it reads as a
chance), and a CUTOFF that chops that chance into a label.
columns -> *dials -> add -> SQUASH -> chance 0-1 -> CUTOFF -> [B] or [M]
## The Squash Curve
The squash curve has an S-shape. Any real number goes in; a chance between 0 and 1 comes
out. The formula:
sig(z) = 1 / (1 + e^-z)
where z is the dial sum: z = b0 + x1*b1 + x2*b2 + ... + x30*b30.
Where does that formula come from? It is not astrology -- derive it from ODDS.
Odds = chance for, divided by chance against:
chance 0.5 -> odds = 0.5 / 0.5 = 1 (even money)
chance 0.8 -> odds = 0.8 / 0.2 = 4 (4 to 1 on)
chance 0.9 -> odds = 0.9 / 0.1 = 9 (9 to 1 on)
Now the one modelling CHOICE in this whole machine: let each +1 step of the dial
sum z MULTIPLY the odds by a fixed amount (call it e, about 2.718 -- any fixed
multiplier gives the same S-shape; e is picked because its slopes come out clean).
So odds = e^z. At z = 0 the odds are e^0 = 1, even money. Each +1 of z nearly
triples the odds; each -1 cuts them to about a third.
Walk it back to a chance. If odds = (chance) / (1 - chance) = e^z, solve by pencil:
chance = e^z x (1 - chance)
chance = e^z - e^z x chance
chance + e^z x chance = e^z
chance x (1 + e^z) = e^z
chance = e^z / (1 + e^z)
Divide top and bottom by e^z:
chance = 1 / (1/e^z + 1) = 1 / (1 + e^-z) <- the squash curve
So the S-curve is nothing but "dial sum sets the odds, odds walked back to a
chance." Check the ends: z huge -> e^-z tiny -> chance near 1. z hugely
negative -> e^-z huge -> chance near 0. z = 0 -> 1/(1+1) = 0.5, even money.
A concrete 4-person worked example, by pencil. Suppose we have only ONE column
(bmi) and 4 people:
person bmi (x) dial sum z = b0 + b1*x
------------------------------------------
A 0.04 1.2 - 30 * 0.04 = 1.2 - 1.2 = 0.0
B 0.06 1.2 - 30 * 0.06 = 1.2 - 1.8 = -0.6
C 0.12 1.2 - 30 * 0.12 = 1.2 - 3.6 = -2.4
D 0.18 1.2 - 30 * 0.18 = 1.2 - 5.4 = -4.2
Now squash each z through the curve:
person z sig(z) = 1 / (1 + e^-z) chance
--------------------------------------------------------------
A 0.0 1 / (1 + e^0) = 1 / (1+1) = 0.500
B -0.6 1 / (1 + e^0.6) = 1 / (1+1.822) = 0.354
C -2.4 1 / (1 + e^2.4) = 1 / (1+11.023) = 0.083
D -4.2 1 / (1 + e^4.2) = 1 / (1+66.686) = 0.015
big positive z -> sig(z) near 1 -> "sick"
big negative z -> sig(z) near 0 -> "well"
z = 0 -> sig(0) = 0.5 -> fence
To compute e^0.6 by pencil: look up a table of exponentials, or note
e^0.5 ~= 1.649 and e^0.1 ~= 1.105, so e^0.6 = e^0.5 * e^0.1
~= 1.649 * 1.105 ~= 1.822. That is close enough for the picture.
>> YOUR TURN
A fifth person E walks in (made-up): bmi 0.10, same dials (b0 = 1.2, b1 = -30).
Compute the dial sum z and the squashed chance on your slate. Hint:
e^1.8 = e^0.6 * e^0.6 * e^0.6, and e^0.6 ~= 1.822 from the note above.
check your slate: z = 1.2 - 30 * 0.10 = 1.2 - 3.0 = -1.8; e^1.8 ~= 1.822 *
1.822 * 1.822 ~= 3.320 * 1.822 ~= 6.05; sig(-1.8) = 1 / (1 + 6.05) = 1 / 7.05
~= 0.142. Below the 0.5 cutoff, E is called well -- about a 14-in-100 chance
of sick.
At the default cutoff of 0.5: if the chance is >= 0.5, call sick; below 0.5, call well.
Patient A (chance 0.500) is exactly at the fence -- essentially 50/50. Patients B, C, D
are called well. The machine also does something deeper: the linear sum z before the
squash is the LOG-ODDS (the natural log of the odds, where odds = chance-sick divided by
chance-well):
log( P(sick|x) / P(well|x) ) = b0 + x^T beta
Each dial beta_j is the change in log-odds per one-unit step in its column, holding the
other 29 fixed. For person A, log-odds = 0 means P(sick) = P(well) = 0.5.
## Setting the Dials: What the Machine Minimises
In the straight-stick rule the machine minimised squared leftovers (MSE). Here the
answer is a label (0 or 1), so squared distance is the wrong ruler. Instead the machine
maximises the likelihood of the observed labels, which is the same as minimising a
CROSS-ENTROPY leftover:
L(beta) = -(1/n) sum [ y*log sig(z) + (1-y)*log(1-sig(z)) ]
For a lump that is truly sick (y=1): if the machine outputs chance ~= 1, log(~1) ~= 0 --
small leftover. If it outputs chance ~= 0 on a sick lump, log of near-zero drops to
bottomless -- enormous leftover. The leftover punishes confidence in the wrong direction. There is no
closed-form minimiser; the dials are found by rolling downhill through the gradient:
gradient of L w.r.t. beta: (1/n) X^T ( sig(X*beta) - y )
each step: beta <- beta - a * gradient (a = step size)
Count the clerk-steps for ONE downhill step. Per lump: 30 multiplies and 30 adds
for the dial sum z, one squash and one subtraction, then 30 multiplies and 30 adds
for the gradient -- call it 125 strokes. The Wisconsin sheet holds 569 lumps, so
one step costs 569 x 125 = 71,125 strokes, and a thousand steps run near 71 million.
That is why the clerks, not you, hold the pencils here.
In code it is the same two-step as before -- make an empty machine, show it the study
pile -- and that code waits at the end of the post.
>> NOTE: WHY SCALED INPUTS?
The gradient update moves every dial by the same step size times its column's values.
A column measured in thousands (area ~= 1000) takes dial steps 1000x larger than one
measured in hundredths (smoothness ~= 0.1). The machine lurches around the large
column's dial and barely moves the small one's. Put every column on the same ruler
first -- mean 0, spread 1 -- and the step sizes become comparable.
## Encoding the Labels
A machine eats numbers, but the sheet holds letters: M (malignant) and B (benign). The
translation is a shelf: M->1, B->0. The bin you are HUNTING gets the 1 -- here that is
the sick lump, because recall, precision, and every score in this post count the sick
detections. (The one-line shelf that does this M->1, B->0 swap is in the code at the end,
along with a warning about running it twice.)
## The Four-Box Table
IN HAND: a machine that adds 30 dialed columns into z, squashes z into a chance
(z = 0 gives 1 / (1 + 1) = 0.5, even money), and chops at a cutoff of 0.5; labels
on the shelf, M->1, B->0. This section adds: a table counting the four ways a
verdict can land.
Every exam lump lands in exactly one of four boxes based on truth and guess:
guess SICK(1) guess WELL(0)
truth SICK(1) CAUGHT | MISSED <- deadly if large
truth WELL(0) ALARM | CLEAR <- wasteful if large
rows = the truth (sick / well)
columns = the guess (shouted sick / said well)
each box = a plain count
A concrete 4-person example, by pencil:
exam pile has 4 lumps: 2 sick (truth=1), 2 well (truth=0)
lump truth machine chance cutoff 0.5 -> guess
----------------------------------------------------
1 sick 0.92 >= 0.5 -> sick (1) CAUGHT
2 sick 0.63 >= 0.5 -> sick (1) CAUGHT
3 well 0.78 >= 0.5 -> sick (1) ALARM
4 well 0.11 < 0.5 -> well (0) CLEAR
Four-box table with counts:
guess SICK guess WELL
truth SICK 2 (CAUGHT) 0 (MISSED) <- 2 truly sick
truth WELL 1 (ALARM) 1 (CLEAR) <- 2 truly well
accuracy = (2 + 1) / 4 = 3/4 = 0.75
recall = 2 / (2 + 0) = 2/2 = 1.00 (caught every sick person)
precision = 2 / (2 + 1) = 2/3 = 0.67 (but 1 healthy person was alarmed)
F1 = 2*0.67*1.00 / (0.67 + 1.00) = 1.34 / 1.67 = 0.80
The machine caught 2/2 sick people (recall=1.00) but at the cost of 1 false
alarm (precision=0.67). The four-box table shows EXACTLY which mistake
happened: look at the bottom-left box (ALARM) and you see the cost in plain
light, not buried inside a single percentage.
>> YOUR TURN
A bigger exam pile (made-up): 10 lumps land as CAUGHT 4, MISSED 1, ALARM 2,
CLEAR 3. Work all four scores on your slate before reading on.
check your slate: everyone = 4 + 1 + 2 + 3 = 10; accuracy = (4 + 3) / 10 =
7/10 = 0.70; recall = 4 / (4 + 1) = 4/5 = 0.80; precision = 4 / (4 + 2) =
4/6 ~= 0.67; F1 = 2 * 0.80 * 0.67 / (0.80 + 0.67) = 1.072 / 1.47 ~= 0.73.
This machine let 1 sick lump walk out the door -- recall 0.80 says so first.
CAUGHT is the ideal: a sick lump correctly flagged. MISSED is the catastrophe: cancer
goes home untreated. ALARM is the waste: a healthy person gets a scare and an
unnecessary biopsy. CLEAR is the other good outcome. Drawing this table from the guesses
-- and the easy-to-miss gotcha about which way round sklearn wants the arguments -- is in
the code at the end.
## Four Scores, Four Angles
IN HAND: a four-box table; on the 4-lump pile the boxes held CAUGHT 2, MISSED 0,
ALARM 1, CLEAR 1 -- 2 + 0 + 1 + 1 = 4, every lump counted once. This section
adds: four scores, each reading a different slice of those boxes.
No single score summarises what a bin-sorter actually does. The four scores each look at
a different slice of the four-box table:
accuracy = (CAUGHT + CLEAR) / everyone
recall = CAUGHT / (CAUGHT + MISSED) <- top row: share of sick caught
precision = CAUGHT / (CAUGHT + ALARM) <- left col: when we cry sick, real sick?
F1 = 2 * precision * recall / (precision + recall)
Score Formula (counts) Punishes
--------- --------------------- ----------------------------
accuracy (TP + TN) / n any wrong answer equally
recall TP / (TP + FN) MISSED (deadly)
precision TP / (TP + FP) ALARM (wasteful)
F1 2PR / (P + R) either metric being low
All four come straight from the four-box counts; the code that reads them off is at the
end of the post.
## Why Accuracy Alone Lies
Now the most important paragraph in the post. The Wisconsin sheet has roughly 63%
benign lumps. So a machine that learns nothing whatsoever -- no dials, no squash, just
one lazy constant shout of "well, well, well" -- walks away with 63% accuracy. On paper
it looks like it is passing. In the exam room it is a catastrophe: it never once catches
a sick person. Accuracy counted every box the same, so a MISSED cancer weighed exactly
as much as a needless ALARM. But those two mistakes are not equal, and pretending they
are is how people get hurt.
fool machine: predict B for everything
accuracy = 0.63 <- looks decent
recall = 0.00 <- catches nobody
precision = N/A <- never shouts sick, no CAUGHT, undefined
** KEY: IN CANCER SCREENING, RECALL IS THE NORTH STAR
Missing a sick person is the catastrophe. You can tolerate some ALARM (extra biopsies
cost money, not lives). So the first score to check is recall -- how large a share of
the sick lumps does the machine catch? Accuracy tells you nothing about which mistake
you are making.
## Common Tripwires I Caught
These are the exact wrong pictures that cost me real time. Each one
stopped me cold until I drew the concrete shape of the mistake:
TRIPWIRE 1: Extra brackets around the scaler input -> 3D error
WRONG: scaler.fit([X_train])
RIGHT: scaler.fit(X_train)
The error says "Found array with dim 3. StandardScaler expected <= 2."
Wrapping X_train in [ ] adds a third layer. The scaler expects a
2D table (rows x columns), not a 2D table inside a list.
TRIPWIRE 2: Running Stage 2 twice -> map fails -> NaN -> crash
WRONG: run the M->1, B->0 cell a second time.
RIGHT: restart the kernel and run all cells top-to-bottom once.
First run: 'M' 'B' 'M' -> map works -> 1 0 1 (int64)
Second run: 1 0 1 -> map finds no 'M' or 'B' -> NaN -> float -> crash.
There is no shortcut after NaN has landed.
TRIPWIRE 3: Predicting on raw X instead of scaled X
WRONG: log_reg.predict(X_test)
RIGHT: log_reg.predict(X_test_scaled)
The machine set its dials on scaled numbers. Raw numbers send
it into a different space -- the dials misfire.
TRIPWIRE 4: confusion_matrix order -- truth first, guess second
WRONG: confusion_matrix(y_pred, y_test)
RIGHT: confusion_matrix(y_test, y_pred)
sklearn always wants: tool(truth, guess) -- same order as
accuracy_score, recall_score, etc. Swapping flips the grid.
TRIPWIRE 5: predict_proba column 0 vs column 1
WRONG: y_proba = model.predict_proba(...)[:, 0]
RIGHT: y_proba = model.predict_proba(...)[:, 1]
Column 0 = P(well). Column 1 = P(sick) -- the one you want.
Using column 0 makes sick people score LOW and well people HIGH,
flipping the ROC curve below the diagonal and AUC < 0.5.
TRIPWIRE 6: C=0.1 is heavy punishment, not light
WRONG: C small = light leash (sounds like "small = mild")
RIGHT: C = 1/lambda. Small C -> large lambda -> heavy squeeze.
C=0.1 -> lambda=10, a tight leash. C=1000 -> lambda=0.001, nearly free.
TRIPWIRE 7: Scaling and L2 are not the same thing
WRONG: "I scaled the columns, so the dials are already under control."
RIGHT: Scaling fixes the INPUTS (columns on one ruler). L2 fixes the
DIALS (no single dial dominates). Even after scaling, a dial can grow
huge if the machine over-trusts one column. Both are needed.
TRIPWIRE 8: predict vs predict_proba
WRONG: use predict for the ROC curve (gives only 0 and 1).
RIGHT: use predict_proba (gives the raw chance before any cutoff).
predict gives two unique values -> ROC curve has two dots -> useless.
predict_proba gives a different number per row -> the full sweep.
## F1: Why the Harmonic Mean
F1 is the harmonic mean of precision and recall, not the arithmetic mean. The harmonic
mean is dominated by whichever value is smaller. If recall = 0.98 but precision = 0.10,
the arithmetic average is 0.54 (looks okay); the harmonic mean F1 = 0.18 (correctly
punishes the terrible precision). F1 collapses toward zero when either score is near
zero -- you cannot paper over one bad score with one great score.
F1 = 2 * P * R / (P + R)
example: P=0.10, R=0.98
arithmetic mean = (0.10 + 0.98) / 2 = 0.54
harmonic mean = 2 * 0.10 * 0.98 / (0.10 + 0.98) = 0.18 <- honest
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
Four short steps, start to finish: turn the M/B letters into 1/0, fit the S-curve machine
on scaled inputs, build the four-box table, then read the four scores off it.
>> NEW TO PYTHON? Each named once:
df['col'] -- pull one named column out of a table (a pandas DataFrame)
.map({...}) -- swap each value using a {from: to} shelf (a dict)
df.drop(columns=) -- a copy of the table with some columns removed
First, encode the answer column. The bin you are HUNTING gets the 1:
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
X = df.drop(columns='diagnosis') # 30 measured columns
y = df['diagnosis'] # 0 or 1
!! WARN: RUN THE MAP CELL ONLY ONCE
If you run the encoding cell a second time, the column already holds 0 and 1. The map
searches for 'M' and 'B' -- finds nothing -- returns all NaN. NaN is a float; every
downstream step breaks. Fix: restart the kernel and run top-to-bottom once. There is
no shortcut after the NaN has landed.
Fit the machine on scaled inputs:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)
Build the four-box table from truth and guess:
conf_matrix = confusion_matrix(y_test, y_pred) # truth first, guess second
sns.heatmap(conf_matrix, annot=True, fmt='d')
grid shape: [[ CLEAR ALARM ] <- truth well (row 0)
[ MISSED CAUGHT ]] <- truth sick (row 1)
!! WARN: TRUTH FIRST, GUESS SECOND
sklearn's confusion_matrix, accuracy_score, recall_score, and precision_score all
want (truth, guess) in that order. Swapping them flips rows and columns and
misassigns every count.
Read the four scores off it:
y_pred = log_reg.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
sort into bins / bin-sorter classification / classifier
S-curve yes/no guesser logistic regression
squash curve sigmoid / logistic function
dial sum (z) log-odds / logit
cross-entropy leftover binary cross-entropy / log-loss
CAUGHT true positive (TP)
ALARM false positive (FP)
MISSED false negative (FN)
CLEAR true negative (TN)
four-box table confusion matrix
recall sensitivity / true positive rate (TPR)
put on one shared ruler standardisation / StandardScaler
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 3 -- Sorting Into Bins):
Part 1 (this post) .
Part 2 -- The Trade Curve .
Part 3 -- Leash and Cloud .
Part 4 -- Picking Settings, Skewed Piles
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================