Sorting Into Bins: The S-Curve, the Four-Box Table, and Why Accuracy Lies

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 3 . SORTING INTO BINS . PART 1 OF 4
  Sorting Into Bins: The S-Curve, the Four-Box Table, and Why Accuracy Lies
  Posted: 2026-06-05 . Author: Rahul Rai . Tags: logistic-regression, classification, evaluation
  ============================================================================================

  PATH . post 6 of 28
    <- prev:  Grading 2: Reading the Dials
       next:  Sorting 2: The Trade Curve ->

  Every post so far ended with a number: a house worth 2.8, a car that does 18 to the
  gallon. This one ends with something heavier -- a verdict. Picture the room. A doctor
  has photographed the cells from a breast lump and measured thirty things about them:
  size, texture, smoothness. The question on the table is not "how malignant?" It is the
  one that keeps people up at night: malignant, OR NOT? Two bins. One answer. And a
  wrong answer that can cost a life.

  The machine still does what it always did -- it brews a sliding number inside. But that
  number can no longer just be read aloud. It has to be bent into a chance, then chopped
  into a label, and from here on every choice we make is haunted by which kind of mistake
  we can least afford to make.


  ## From Number to Bin

    previous posts (number):   ----------*------>   answer = 2.83

    this post (bin):           [ B = well ]  [ M = sick ]   answer = ONE box

  The dial-adding machine can only produce a sliding number. To land in one of two bins it
  needs two more steps: a SQUASH that bends any number into the range 0-1 (so it reads as a
  chance), and a CUTOFF that chops that chance into a label.

    columns -> *dials -> add -> SQUASH -> chance 0-1 -> CUTOFF -> [B] or [M]


  ## The Squash Curve

  The squash curve has an S-shape. Any real number goes in; a chance between 0 and 1 comes
  out. The formula:

      sig(z) = 1 / (1 + e^-z)

  where z is the dial sum: z = b0 + x1*b1 + x2*b2 + ... + x30*b30.

  Where does that formula come from?  It is not astrology -- derive it from ODDS.
  Odds = chance for, divided by chance against:

    chance 0.5  ->  odds = 0.5 / 0.5 = 1      (even money)
    chance 0.8  ->  odds = 0.8 / 0.2 = 4      (4 to 1 on)
    chance 0.9  ->  odds = 0.9 / 0.1 = 9      (9 to 1 on)

  Now the one modelling CHOICE in this whole machine: let each +1 step of the dial
  sum z MULTIPLY the odds by a fixed amount (call it e, about 2.718 -- any fixed
  multiplier gives the same S-shape; e is picked because its slopes come out clean).
  So odds = e^z.  At z = 0 the odds are e^0 = 1, even money.  Each +1 of z nearly
  triples the odds; each -1 cuts them to about a third.

  Walk it back to a chance.  If odds = (chance) / (1 - chance) = e^z, solve by pencil:

    chance        = e^z x (1 - chance)
    chance        = e^z - e^z x chance
    chance + e^z x chance = e^z
    chance x (1 + e^z)    = e^z
    chance        = e^z / (1 + e^z)

  Divide top and bottom by e^z:

    chance        = 1 / (1/e^z + 1)  =  1 / (1 + e^-z)        <- the squash curve

  So the S-curve is nothing but "dial sum sets the odds, odds walked back to a
  chance."  Check the ends: z huge -> e^-z tiny -> chance near 1.  z hugely
  negative -> e^-z huge -> chance near 0.  z = 0 -> 1/(1+1) = 0.5, even money.

  A concrete 4-person worked example, by pencil.  Suppose we have only ONE column
  (bmi) and 4 people:

    person   bmi (x)   dial sum z = b0 + b1*x
    ------------------------------------------
      A      0.04      1.2 - 30 * 0.04 = 1.2 - 1.2 = 0.0
      B      0.06      1.2 - 30 * 0.06 = 1.2 - 1.8 = -0.6
      C      0.12      1.2 - 30 * 0.12 = 1.2 - 3.6 = -2.4
      D      0.18      1.2 - 30 * 0.18 = 1.2 - 5.4 = -4.2

  Now squash each z through the curve:

    person   z       sig(z) = 1 / (1 + e^-z)             chance
    --------------------------------------------------------------
      A      0.0     1 / (1 + e^0)     = 1 / (1+1)       = 0.500
      B     -0.6     1 / (1 + e^0.6)   = 1 / (1+1.822)   = 0.354
      C     -2.4     1 / (1 + e^2.4)   = 1 / (1+11.023)  = 0.083
      D     -4.2     1 / (1 + e^4.2)   = 1 / (1+66.686)  = 0.015

    big positive z  ->  sig(z) near 1  ->  "sick"
    big negative z  ->  sig(z) near 0  ->  "well"
    z = 0           ->  sig(0) = 0.5   ->  fence

    To compute e^0.6 by pencil: look up a table of exponentials, or note
    e^0.5 ~= 1.649 and e^0.1 ~= 1.105, so e^0.6 = e^0.5 * e^0.1
    ~= 1.649 * 1.105 ~= 1.822.  That is close enough for the picture.

  >> YOUR TURN
     A fifth person E walks in (made-up): bmi 0.10, same dials (b0 = 1.2, b1 = -30).
     Compute the dial sum z and the squashed chance on your slate.  Hint:
     e^1.8 = e^0.6 * e^0.6 * e^0.6, and e^0.6 ~= 1.822 from the note above.

     check your slate:  z = 1.2 - 30 * 0.10 = 1.2 - 3.0 = -1.8;  e^1.8 ~= 1.822 *
     1.822 * 1.822 ~= 3.320 * 1.822 ~= 6.05;  sig(-1.8) = 1 / (1 + 6.05) = 1 / 7.05
     ~= 0.142.  Below the 0.5 cutoff, E is called well -- about a 14-in-100 chance
     of sick.

  At the default cutoff of 0.5: if the chance is >= 0.5, call sick; below 0.5, call well.
  Patient A (chance 0.500) is exactly at the fence -- essentially 50/50.  Patients B, C, D
  are called well.  The machine also does something deeper: the linear sum z before the
  squash is the LOG-ODDS (the natural log of the odds, where odds = chance-sick divided by
  chance-well):

      log( P(sick|x) / P(well|x) ) = b0 + x^T beta

  Each dial beta_j is the change in log-odds per one-unit step in its column, holding the
  other 29 fixed.  For person A, log-odds = 0 means P(sick) = P(well) = 0.5.


  ## Setting the Dials: What the Machine Minimises

  In the straight-stick rule the machine minimised squared leftovers (MSE). Here the
  answer is a label (0 or 1), so squared distance is the wrong ruler. Instead the machine
  maximises the likelihood of the observed labels, which is the same as minimising a
  CROSS-ENTROPY leftover:

      L(beta) = -(1/n) sum [ y*log sig(z) + (1-y)*log(1-sig(z)) ]

  For a lump that is truly sick (y=1): if the machine outputs chance ~= 1, log(~1) ~= 0 --
  small leftover. If it outputs chance ~= 0 on a sick lump, log of near-zero drops to
  bottomless -- enormous leftover. The leftover punishes confidence in the wrong direction. There is no
  closed-form minimiser; the dials are found by rolling downhill through the gradient:

    gradient of L w.r.t. beta:   (1/n) X^T ( sig(X*beta) - y )

    each step:  beta <- beta - a * gradient   (a = step size)

  Count the clerk-steps for ONE downhill step.  Per lump: 30 multiplies and 30 adds
  for the dial sum z, one squash and one subtraction, then 30 multiplies and 30 adds
  for the gradient -- call it 125 strokes.  The Wisconsin sheet holds 569 lumps, so
  one step costs 569 x 125 = 71,125 strokes, and a thousand steps run near 71 million.
  That is why the clerks, not you, hold the pencils here.

  In code it is the same two-step as before -- make an empty machine, show it the study
  pile -- and that code waits at the end of the post.

  >> NOTE: WHY SCALED INPUTS?
     The gradient update moves every dial by the same step size times its column's values.
     A column measured in thousands (area ~= 1000) takes dial steps 1000x larger than one
     measured in hundredths (smoothness ~= 0.1). The machine lurches around the large
     column's dial and barely moves the small one's. Put every column on the same ruler
     first -- mean 0, spread 1 -- and the step sizes become comparable.


  ## Encoding the Labels

  A machine eats numbers, but the sheet holds letters: M (malignant) and B (benign). The
  translation is a shelf: M->1, B->0. The bin you are HUNTING gets the 1 -- here that is
  the sick lump, because recall, precision, and every score in this post count the sick
  detections. (The one-line shelf that does this M->1, B->0 swap is in the code at the end,
  along with a warning about running it twice.)


  ## The Four-Box Table

  IN HAND: a machine that adds 30 dialed columns into z, squashes z into a chance
  (z = 0 gives 1 / (1 + 1) = 0.5, even money), and chops at a cutoff of 0.5; labels
  on the shelf, M->1, B->0.  This section adds: a table counting the four ways a
  verdict can land.

  Every exam lump lands in exactly one of four boxes based on truth and guess:

                   guess SICK(1)    guess WELL(0)
    truth SICK(1)  CAUGHT         | MISSED         <- deadly if large
    truth WELL(0)  ALARM          | CLEAR          <- wasteful if large

    rows    = the truth     (sick / well)
    columns = the guess     (shouted sick / said well)
    each box = a plain count

  A concrete 4-person example, by pencil:

    exam pile has 4 lumps: 2 sick (truth=1), 2 well (truth=0)

    lump   truth   machine chance   cutoff 0.5 -> guess
    ----------------------------------------------------
    1      sick    0.92            >= 0.5 -> sick (1)     CAUGHT
    2      sick    0.63            >= 0.5 -> sick (1)     CAUGHT
    3      well    0.78            >= 0.5 -> sick (1)     ALARM
    4      well    0.11            <  0.5 -> well (0)     CLEAR

    Four-box table with counts:

                   guess SICK      guess WELL
    truth SICK     2 (CAUGHT)     0 (MISSED)      <- 2 truly sick
    truth WELL     1 (ALARM)      1 (CLEAR)        <- 2 truly well

    accuracy  = (2 + 1) / 4 = 3/4 = 0.75
    recall    = 2 / (2 + 0) = 2/2 = 1.00   (caught every sick person)
    precision = 2 / (2 + 1) = 2/3 = 0.67   (but 1 healthy person was alarmed)
    F1        = 2*0.67*1.00 / (0.67 + 1.00) = 1.34 / 1.67 = 0.80

    The machine caught 2/2 sick people (recall=1.00) but at the cost of 1 false
    alarm (precision=0.67).  The four-box table shows EXACTLY which mistake
    happened: look at the bottom-left box (ALARM) and you see the cost in plain
    light, not buried inside a single percentage.

  >> YOUR TURN
     A bigger exam pile (made-up): 10 lumps land as CAUGHT 4, MISSED 1, ALARM 2,
     CLEAR 3.  Work all four scores on your slate before reading on.

     check your slate:  everyone = 4 + 1 + 2 + 3 = 10;  accuracy = (4 + 3) / 10 =
     7/10 = 0.70;  recall = 4 / (4 + 1) = 4/5 = 0.80;  precision = 4 / (4 + 2) =
     4/6 ~= 0.67;  F1 = 2 * 0.80 * 0.67 / (0.80 + 0.67) = 1.072 / 1.47 ~= 0.73.
     This machine let 1 sick lump walk out the door -- recall 0.80 says so first.

  CAUGHT is the ideal: a sick lump correctly flagged. MISSED is the catastrophe: cancer
  goes home untreated. ALARM is the waste: a healthy person gets a scare and an
  unnecessary biopsy. CLEAR is the other good outcome. Drawing this table from the guesses
  -- and the easy-to-miss gotcha about which way round sklearn wants the arguments -- is in
  the code at the end.


  ## Four Scores, Four Angles

  IN HAND: a four-box table; on the 4-lump pile the boxes held CAUGHT 2, MISSED 0,
  ALARM 1, CLEAR 1 -- 2 + 0 + 1 + 1 = 4, every lump counted once.  This section
  adds: four scores, each reading a different slice of those boxes.

  No single score summarises what a bin-sorter actually does. The four scores each look at
  a different slice of the four-box table:

    accuracy  = (CAUGHT + CLEAR) / everyone
    recall    =  CAUGHT / (CAUGHT + MISSED)      <- top row: share of sick caught
    precision =  CAUGHT / (CAUGHT + ALARM)       <- left col: when we cry sick, real sick?
    F1        =  2 * precision * recall / (precision + recall)

    Score       Formula (counts)        Punishes
    ---------   ---------------------   ----------------------------
    accuracy    (TP + TN) / n           any wrong answer equally
    recall      TP / (TP + FN)          MISSED (deadly)
    precision   TP / (TP + FP)          ALARM (wasteful)
    F1          2PR / (P + R)           either metric being low

  All four come straight from the four-box counts; the code that reads them off is at the
  end of the post.


  ## Why Accuracy Alone Lies

  Now the most important paragraph in the post. The Wisconsin sheet has roughly 63%
  benign lumps. So a machine that learns nothing whatsoever -- no dials, no squash, just
  one lazy constant shout of "well, well, well" -- walks away with 63% accuracy. On paper
  it looks like it is passing. In the exam room it is a catastrophe: it never once catches
  a sick person. Accuracy counted every box the same, so a MISSED cancer weighed exactly
  as much as a needless ALARM. But those two mistakes are not equal, and pretending they
  are is how people get hurt.

    fool machine: predict B for everything
    accuracy  = 0.63   <- looks decent
    recall    = 0.00   <- catches nobody
    precision = N/A    <- never shouts sick, no CAUGHT, undefined

  ** KEY: IN CANCER SCREENING, RECALL IS THE NORTH STAR
     Missing a sick person is the catastrophe. You can tolerate some ALARM (extra biopsies
     cost money, not lives). So the first score to check is recall -- how large a share of
     the sick lumps does the machine catch? Accuracy tells you nothing about which mistake
     you are making.


  ## Common Tripwires I Caught

  These are the exact wrong pictures that cost me real time.  Each one
  stopped me cold until I drew the concrete shape of the mistake:

    TRIPWIRE 1:  Extra brackets around the scaler input -> 3D error
       WRONG: scaler.fit([X_train])
       RIGHT: scaler.fit(X_train)
       The error says "Found array with dim 3. StandardScaler expected <= 2."
       Wrapping X_train in [ ] adds a third layer.  The scaler expects a
       2D table (rows x columns), not a 2D table inside a list.

    TRIPWIRE 2:  Running Stage 2 twice -> map fails -> NaN -> crash
       WRONG: run the M->1, B->0 cell a second time.
       RIGHT: restart the kernel and run all cells top-to-bottom once.
       First run: 'M' 'B' 'M' -> map works -> 1 0 1 (int64)
       Second run: 1 0 1 -> map finds no 'M' or 'B' -> NaN -> float -> crash.
       There is no shortcut after NaN has landed.

    TRIPWIRE 3:  Predicting on raw X instead of scaled X
       WRONG: log_reg.predict(X_test)
       RIGHT: log_reg.predict(X_test_scaled)
       The machine set its dials on scaled numbers.  Raw numbers send
       it into a different space -- the dials misfire.

    TRIPWIRE 4:  confusion_matrix order -- truth first, guess second
       WRONG: confusion_matrix(y_pred, y_test)
       RIGHT: confusion_matrix(y_test, y_pred)
       sklearn always wants: tool(truth, guess) -- same order as
       accuracy_score, recall_score, etc.  Swapping flips the grid.

    TRIPWIRE 5:  predict_proba column 0 vs column 1
       WRONG: y_proba = model.predict_proba(...)[:, 0]
       RIGHT: y_proba = model.predict_proba(...)[:, 1]
       Column 0 = P(well).  Column 1 = P(sick) -- the one you want.
       Using column 0 makes sick people score LOW and well people HIGH,
       flipping the ROC curve below the diagonal and AUC < 0.5.

    TRIPWIRE 6:  C=0.1 is heavy punishment, not light
       WRONG: C small = light leash (sounds like "small = mild")
       RIGHT: C = 1/lambda.  Small C -> large lambda -> heavy squeeze.
       C=0.1 -> lambda=10, a tight leash.  C=1000 -> lambda=0.001, nearly free.

    TRIPWIRE 7:  Scaling and L2 are not the same thing
       WRONG: "I scaled the columns, so the dials are already under control."
       RIGHT: Scaling fixes the INPUTS (columns on one ruler).  L2 fixes the
       DIALS (no single dial dominates).  Even after scaling, a dial can grow
       huge if the machine over-trusts one column.  Both are needed.

    TRIPWIRE 8:  predict vs predict_proba
       WRONG: use predict for the ROC curve (gives only 0 and 1).
       RIGHT: use predict_proba (gives the raw chance before any cutoff).
       predict gives two unique values -> ROC curve has two dots -> useless.
       predict_proba gives a different number per row -> the full sweep.


  ## F1: Why the Harmonic Mean

  F1 is the harmonic mean of precision and recall, not the arithmetic mean. The harmonic
  mean is dominated by whichever value is smaller. If recall = 0.98 but precision = 0.10,
  the arithmetic average is 0.54 (looks okay); the harmonic mean F1 = 0.18 (correctly
  punishes the terrible precision). F1 collapses toward zero when either score is near
  zero -- you cannot paper over one bad score with one great score.

    F1 = 2 * P * R / (P + R)

    example:  P=0.10,  R=0.98
    arithmetic mean = (0.10 + 0.98) / 2 = 0.54
    harmonic mean   = 2 * 0.10 * 0.98 / (0.10 + 0.98) = 0.18   <- honest


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  Four short steps, start to finish: turn the M/B letters into 1/0, fit the S-curve machine
  on scaled inputs, build the four-box table, then read the four scores off it.

  >> NEW TO PYTHON? Each named once:
       df['col']          -- pull one named column out of a table (a pandas DataFrame)
       .map({...})        -- swap each value using a {from: to} shelf (a dict)
       df.drop(columns=)  -- a copy of the table with some columns removed

  First, encode the answer column. The bin you are HUNTING gets the 1:

    df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
    X = df.drop(columns='diagnosis')   # 30 measured columns
    y = df['diagnosis']                # 0 or 1

  !! WARN: RUN THE MAP CELL ONLY ONCE
     If you run the encoding cell a second time, the column already holds 0 and 1. The map
     searches for 'M' and 'B' -- finds nothing -- returns all NaN. NaN is a float; every
     downstream step breaks. Fix: restart the kernel and run top-to-bottom once. There is
     no shortcut after the NaN has landed.

  Fit the machine on scaled inputs:

    log_reg = LogisticRegression(random_state=42)
    log_reg.fit(X_train_scaled, y_train)

  Build the four-box table from truth and guess:

    conf_matrix = confusion_matrix(y_test, y_pred)   # truth first, guess second
    sns.heatmap(conf_matrix, annot=True, fmt='d')

    grid shape:  [[ CLEAR  ALARM  ]    <- truth well  (row 0)
                  [ MISSED CAUGHT ]]   <- truth sick  (row 1)

  !! WARN: TRUTH FIRST, GUESS SECOND
     sklearn's confusion_matrix, accuracy_score, recall_score, and precision_score all
     want (truth, guess) in that order. Swapping them flips rows and columns and
     misassigns every count.

  Read the four scores off it:

    y_pred = log_reg.predict(X_test_scaled)
    accuracy  = accuracy_score(y_test, y_pred)
    recall    = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1        = f1_score(y_test, y_pred)


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    sort into bins / bin-sorter           classification / classifier
    S-curve yes/no guesser                logistic regression
    squash curve                          sigmoid / logistic function
    dial sum (z)                          log-odds / logit
    cross-entropy leftover                binary cross-entropy / log-loss
    CAUGHT                                true positive (TP)
    ALARM                                 false positive (FP)
    MISSED                                false negative (FN)
    CLEAR                                 true negative (TN)
    four-box table                        confusion matrix
    recall                                sensitivity / true positive rate (TPR)
    put on one shared ruler               standardisation / StandardScaler

----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 3 -- Sorting Into Bins):
    Part 1 (this post) .
    Part 2 -- The Trade Curve .
    Part 3 -- Leash and Cloud .
    Part 4 -- Picking Settings, Skewed Piles

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================