Picking Settings, Skewed Piles, and Averaging Many Classes

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 3 . SORTING INTO BINS . PART 4 OF 4
  Picking Settings, Skewed Piles, and Averaging Many Classes
  Posted: 2026-06-05 . Author: Rahul Rai . Tags: grid-search, precision-recall, class-imbalance
  ============================================================================================

  PATH . post 9 of 28
    <- prev:  Sorting 3: Leash and Cloud
       next:  Chapter 4, Part 1: The Leash (Ridge & Lasso) ->

  So far the world has been tidy: one sheet of breast lumps, one clean yes-or-no question,
  two groups of roughly equal size. Real problems are rarely so polite. This closing post
  drags the series out into the messy world and arms you for four things it will throw at
  you.

  How do you tune a machine's settings without cheating by peeking at the exam? What do
  you do when one bin outnumbers the other ninety-nine to one and your proud 99% accuracy
  turns out to be worthless? Which ruler do you trust when an outlier wanders in? And how
  do you grade a machine that is sorting into not two bins but ten? Answer these and you
  have left the textbook behind and stepped onto the workshop floor.


  ## Settings I Pick By Hand (Hyperparameters)

  Some numbers inside a machine are set by rolling downhill on the leftover -- the dials
  beta in the S-curve machine. Others are decided BEFORE training starts and are never
  touched by the leftover: the leash strength C, the neighbour-count k, the depth of a
  decision tree. These are settings you pick by hand. The machine cannot tune them for
  itself because the leftover doesn't flow back through them.

    dials (beta):   tuned by the machine as it trains   <- the machine adjusts these
    settings:       chosen by you before training starts <- the machine never sees these

    examples:
      C in LogisticRegression(C=0.1)    <- leash strength
      k in KNeighborsClassifier(k=5)    <- how many neighbours to ask
      n_estimators in RandomForest      <- how many trees to grow


  ## The Grid Hunt

  You have several settings, each with several candidate values. A grid lists every
  combination. You try each one and measure how well it does -- but you must not use the
  sealed exam pile to measure. The fix is rotating folds: split the study pile into, say, 5
  equal strips; for each combination, train on 4 strips, score on the 5th, rotate which
  strip is left out, repeat, then average the 5 scores.

    grid for C and k:
    +-------+----+----+----+
    |       |C=.1|C=1 |C=10|
    +-------+----+----+----+
    | k = 3 | .  | .  | .  |   each cell = a combination
    | k = 5 | .  | .  | .  |   each dot = average of 5 fold scores
    | k = 7 | .  | .  | .  |   pick the cell with the best score
    +-------+----+----+----+

  Three things go into making a grid hunt trustworthy:

    1. The range and spacing of each setting. If C should be 0.01 the grid must include
       0.01; a grid of {1, 10, 100} misses it entirely. Both the candidate values and how
       finely you space them are decisions you make up front.
    2. Rotating folds (cross-validation) to measure each combination reliably. A single
       train/score split might be lucky or unlucky. Rotating over 5 strips gives 5
       independent readings and averages out that luck.
    3. A scoring rule that matches the actual goal: accuracy if all mistakes cost the same;
       recall if missing sick people is catastrophic; F1 if both false alarms and misses
       matter.

  A concrete 5-fold example for C=0.1 vs C=1.0, by pencil.  Study pile has
  100 lumps.  Split into 5 strips of 20 each.

    strip   C=0.1 recall   C=1.0 recall
    ---------------------------------------
     1      0.92           0.95
     2      0.88           0.90
     3      0.94           0.89
     4      0.90           0.93
     5      0.86           0.82

    average:  (0.92+0.88+0.94+0.90+0.86)/5 = 4.50/5 = 0.900  (C=0.1)
              (0.95+0.90+0.89+0.93+0.82)/5 = 4.49/5 = 0.898  (C=1.0)

    C=0.1 wins by 0.002.  Without the 5-fold average, you might
    have picked strip 5 where C=1.0 scores only 0.82 -- a lucky
    stick in that strip's eye.  The fold average smooths out the
    luck and shows the true tendency: C=0.1 is slightly better
    at recall on this sheet.

  >> YOUR TURN
     A grid pairs 4 candidate C-values with 3 candidate k-values, and scores every
     pair by 5-fold rotation (made-up).  How many full model-fits is that?

     check your slate:  pairs = 4 x 3 = 12;  fits = 12 x 5 folds = 60 complete fits.
     Each fit is a whole downhill solve -- this is why a clerk-room, not a single
     clerk, runs a grid hunt, and why the grid's size is a cost you choose.

  In code the whole hunt is a Pipeline -- scaler plus classifier -- handed to GridSearchCV;
  that code is at the end of the post, and it carries one subtle trap worth stating now:

  !! WARN: BUILD THE RULER INSIDE EACH FOLD, NOT BEFORE THE HUNT
     A subtle leak: if you scale all of X_train ONCE up front and then pass that pre-scaled
     pile to GridSearchCV, every validation strip has already helped shape the ruler (its
     mean and spread). The strip you score on has leaked into the scaling, so the fold
     scores come out optimistic -- even though the final exam pile (X_test) is still
     sealed. The fix is to put the scaler INSIDE a Pipeline and pass raw X_train: now the
     ruler is rebuilt from each fold's 4 training strips only, and the held-out strip is
     measured by a ruler it never touched. Same peeking rule from Part 1, one level deeper.

  >> NOTE: THE EXAM PILE IS STILL SEALED
     GridSearchCV only ever touches the study pile (X_train). The exam pile (X_test) is
     opened exactly once at the very end to report the final score. If you tune settings on
     the exam pile, you are leaking future information and the final score is optimistic --
     the machine has already peeked.


  ## Two Kinds of Same-Ruler

  Parts 1-3 used the standard ruler: shift each column to mean 0, divide by its spread.
  There is a second popular ruler -- pinch-to-fit -- that squishes every column's values to
  sit strictly between 0 and 1.

    standard ruler:  x~ = (x - mean) / spread        -> centred at 0, spread ~= 1
    pinch-to-fit:    x~ = (x - min)  / (max - min)   -> bounded in [0, 1]

  !! WARN: PINCH-TO-FIT IS FRAGILE AROUND OUTLIERS
     One outlier stretches the range. Say house areas run 500-3000 sq ft but one mansion
     hits 15000. The (max-min) denominator becomes 14500. Every other house gets squished
     into the bottom 17% of [0, 1]. The outlier sits at 1.0; the rest cluster near 0. The
     ruler works mathematically but destroys the relative spacing of normal values. The
     standard ruler shifts and stretches by the spread, so one distant outlier only weakly
     pulls the spread, and the bulk of the data keeps its shape.

    area values:  500, 520, 3000, 15000

    pinch-to-fit (max=15000, min=500):
      500   ->  0.000
      520   ->  0.001      <- normal house, near 0
      3000  ->  0.172      <- large house, still near 0
      15000 ->  1.000      <- outlier, fills the far end

    standard ruler (mean ~= 4755, spread ~= 6150):
      500   -> -0.70
      520   -> -0.69       <- normal houses spread out
      3000  -> -0.29
      15000 ->  1.67       <- outlier: unusual but doesn't crush others

  For the same-ruler rule (build from study only, apply to both), the same peeking logic
  applies: compute min, max (or mean, spread) from the study pile only.


  ## When ALARM Matters More Than MISSED

  Part 1 argued that in cancer screening, recall is the north star because a MISSED sick
  person is catastrophic. In other settings, the ALARM is the catastrophe.

    precision v when ALARM is large   <- false positives pile up
    recall    v when MISSED is large  <- false negatives pile up

    cancer screening:  MISSED = cancer goes untreated    -> prioritise recall
    legal evidence:    ALARM  = innocent person jailed    -> prioritise precision
    email spam:        ALARM  = real email deleted        -> precision matters more
    fraud detection:   MISSED = fraud slips through -> loss -> recall matters

  ** KEY: PRECISION REDUCES ALARMS; RECALL REDUCES MISSES
     Choosing which to prioritise is a business or clinical decision, not a machine
     decision. The machine gives you a curve of deals (the trade curve from Part 2); you
     pick the deal that matches what each kind of mistake actually costs.

  In fraud detection, MISSED = a fraudulent transaction slips through and the business
  absorbs the loss. ALARM = a legitimate transaction is blocked and the customer is
  frustrated. For most fraud systems the financial loss of a missed fraud is worse than the
  friction of a blocked legitimate purchase, so recall is the primary score -- but both
  matter, and precision-recall curves (next) show the exact deal between them.


  ## Skewed Piles and the Precision-Recall Curve

  The breast lump sheet leaned 63% well to 37% sick -- tilted, but you could still stand on
  it. Now imagine the floor pitched almost vertical. A real fraud sheet might run 99.9%
  honest and 0.1% crooked. That is a SKEWED pile, and on ground this steep the ordinary
  scores quietly start lying to you.

    skewed fraud sheet: 100000 rows
      99900  normal transactions   (class 0)
        100  fraud transactions    (class 1)

    fool machine: always call "normal"
      accuracy = 99900 / 100000 = 99.9%   <- looks extraordinary
      recall   = 0 / 100 = 0.0%           <- catches no fraud at all

  The ROC curve (trade curve) also has a problem with skewed piles. FPR divides ALARM by
  all truly-normal rows -- that denominator is 99900. Even if the machine raises 2000 false
  alarms, FPR = 2000/99900 = 0.02, which looks small and makes the ROC curve bulge
  optimistically. The 2000 blocked customers are invisible in that fraction.

    ROC curve is blind to class imbalance because:
    FPR = ALARM / (ALARM + CLEAR)   <- denominator is huge; ALARM looks tiny

    Precision-recall curve avoids this:
    precision = CAUGHT / (CAUGHT + ALARM)  <- no TN in sight; focuses on the sick pile
    recall    = CAUGHT / (CAUGHT + MISSED) <- also ignores TN

  The precision-recall curve sweeps the same cutoff as the trade curve but plots precision
  on the Y-axis and recall on the X-axis. A perfect machine hugs the top-right corner (high
  precision AND high recall). A flat line near precision = (fraction of sick in the sheet)
  is the baseline -- the always-shout-sick fool.

    precision
     1 | # <- perfect (catch all, zero false alarms)
       | ##
       |   ##
       | - - - - <- baseline (= fraction of sick in the sheet)
     0 +----------- recall
       0              1

    area under precision-recall curve -> average precision (AP)

  ** KEY: USE PRECISION-RECALL CURVES WHEN THE PILE IS SKEWED
     When one class is rare (fraud, disease in a healthy population, defects on a line),
     the precision-recall curve is more informative than the ROC curve because it does not
     use the true-negative count at all. The ROC curve's FPR is diluted by the massive well
     pile; the precision-recall curve ignores that pile entirely.

  Computing that curve, and its area (the average precision), is three lines -- waiting in
  the code at the end of the post.


  ## Scoring With Three or More Bins

  Cancer has subtypes. Handwritten digits have ten classes. Sentiment has three: positive,
  neutral, negative. Now there are multiple bins and one machine must sort into all of
  them. The four-box table expands into a K x K grid (one row and column per class). Each
  class gets its own precision, recall, and F1. The question becomes: how do you average
  those K scores into one number?


  IN HAND: a three-class scorecard -- type A F1 0.89, type B 0.67, type C 0.34 -- with
  class C holding only 10 rows against the others' 1000.  This section asks how to roll
  three F1 numbers into one, and shows the two answers disagree on purpose.

  ## Two Averaging Rules

    three-class example: cancer type A, B, C
    +---------+-----------+--------+------+
    | class   | precision | recall |  F1  |  n (study rows)
    +---------+-----------+--------+------+
    | type A  |   0.90    |  0.88  | 0.89 |  1000
    | type B  |   0.70    |  0.65  | 0.67 |  1000
    | type C  |   0.30    |  0.40  | 0.34 |    10
    +---------+-----------+--------+------+

  TREAT-ALL-CLASSES-EQUAL averaging (macro): compute precision, recall, F1 per class, then
  take the plain average across classes with equal weight.

    macro F1 = (0.89 + 0.67 + 0.34) / 3 = 0.63

  >> YOUR TURN
     Three classes score F1 of 0.80, 0.60, 0.40 (made-up).  Work the macro F1 -- the
     equal-weight average.

     check your slate:  macro F1 = (0.80 + 0.60 + 0.40) / 3 = 1.80 / 3 = 0.60.  Macro
     weights a 10-row class the same as a 1000-row class, so a weak rare class drags
     the score down on purpose -- that is the honest per-class picture.

    class C (only 10 rows) gets the same weight as class A (1000 rows)
    -> a low F1 on class C drags the average down strongly
    -> honest picture of per-class performance regardless of class size

  COUNT-EVERY-LABEL averaging (micro): pool all CAUGHT, ALARM, and MISSED counts across
  every class, then compute one precision, one recall, one F1 from those pooled totals.

    micro pools raw counts:
    total CAUGHT across A+B+C = 880 + 650 + 4   = 1534
    total ALARM  across A+B+C = 97  + 278 + 9   = 384
    total MISSED across A+B+C = 120 + 350 + 6   = 476

    micro precision = 1534 / (1534 + 384)  = 0.80
    micro recall    = 1534 / (1534 + 476)  = 0.76

    class A (1000 rows) dominates; class C (10 rows) barely registers

    Method                 How it weights classes              Use when
    --------------------   --------------------------------    --------------------------
    Macro (equal weight)   every class counts the same         all classes equally
                                                               important; or skewed pile
    Micro (count-weight)   large classes dominate              overall accuracy on all
                                                               labels matters

  ** KEY: MACRO WHEN CLASSES ARE EQUALLY IMPORTANT OR PILE IS SKEWED
     With a skewed pile, micro averaging is swamped by the majority class. If type C cancer
     (10 patients) scores F1=0.34 and the machine is otherwise excellent on the common
     types, micro averaging buries type C's failure. Macro averaging gives type C equal
     weight and forces the failure to show. That is the right signal when you care about
     catching every cancer type regardless of how rare it is.

  One call -- classification_report -- prints every per-class score plus the macro and
  weighted averages (code at the end). It hides one surprise worth flagging right here:

  >> NOTE: WHY classification_report SHOWS "accuracy", NOT "micro avg"
     On ordinary single-label data (each row has exactly one true class), every false
     positive for one class is simultaneously a false negative for another -- the pooled
     counts make micro precision, micro recall, and micro F1 all collapse to the same
     number: plain accuracy. So classification_report prints an "accuracy" row instead of a
     "micro avg" row. If you want the micro metric under its own name, compute it with
     precision_recall_fscore_support(..., average='micro'). (A true "micro avg" row only
     appears for multi-label problems, where the collapse doesn't happen.)


  ## Summary: Which Tool for Which Problem

    +------------------------------+------------------------------------------+
    | Situation                    | Reach for                                |
    +------------------------------+------------------------------------------+
    | tuning C, k, depth           | grid hunt + rotating folds (GridSearchCV |
    |                              |  with cv=5) scored on a relevant metric  |
    | columns on wild scales       | standard ruler (mean 0, spread 1)        |
    | columns, but outliers present| standard ruler beats pinch-to-fit        |
    | balanced bins                | ROC / AUC; accuracy OK baseline          |
    | skewed pile (fraud, rare dz) | precision-recall curve + average prec.   |
    | ALARM is the catastrophe     | optimise precision                       |
    | MISSED is the catastrophe    | optimise recall                          |
    | 3+ classes, all equally imp. | macro averaging                          |
    | 3+ classes, size = importance| micro (or weighted) averaging            |
    +------------------------------+------------------------------------------+


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  Three pieces, in the order the post met them: the grid hunt (done safely, scaler inside
  the pipeline), the precision-recall curve for skewed piles, and the many-bin scores.

  >> NEW TO PYTHON? Each named once:
       Pipeline([...])     -- chain steps so they are refit together, leak-free
       {'clf__C': [...]}   -- a dict of settings to try; 'clf__C' names a step's knob
       d['key']            -- look a value up in a dict by its name

  The grid hunt -- a Pipeline of scaler + classifier, handed to GridSearchCV. Pass RAW
  X_train, so the scaler is rebuilt inside each fold and nothing leaks:

    from sklearn.model_selection import GridSearchCV
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    # scaler lives INSIDE the pipeline so it is rebuilt on each fold's
    # training strip only -- never on the strip being scored
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('clf',    LogisticRegression()),
    ])
    param_grid = {'clf__C': [0.01, 0.1, 1, 10], 'clf__penalty': ['l2']}
    gs = GridSearchCV(pipe, param_grid, cv=5, scoring='recall')
    gs.fit(X_train, y_train)          # pass RAW X_train, not pre-scaled
    best_C = gs.best_params_['clf__C']

  The precision-recall curve for a skewed pile, and its area (average precision):

    from sklearn.metrics import precision_recall_curve, average_precision_score

    y_proba = model.predict_proba(X_test_scaled)[:, 1]
    precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
    ap = average_precision_score(y_test, y_proba)   # area under the curve

  And the many-bin scores -- one report, plus the micro/macro numbers asked for by name:

    from sklearn.metrics import classification_report, precision_recall_fscore_support

    print(classification_report(y_test, y_pred,
          target_names=['type A', 'type B', 'type C']))
    # per-class precision/recall/F1, then: accuracy, macro avg, weighted avg

    # note: for ordinary single-label multiclass, micro precision = micro recall
    # = micro F1 = accuracy, so classification_report does NOT print a "micro avg"
    # row -- it prints "accuracy" instead. To see the micro number by name, ask for it:
    micro = precision_recall_fscore_support(y_test, y_pred, average='micro')
    macro = precision_recall_fscore_support(y_test, y_pred, average='macro')
    # micro[:3] == (accuracy, accuracy, accuracy);  macro[:3] == the equal-weight blend


  ## Where That Leaves Us

  And there the series rests. Look back at the road: we began by turning a sliding number
  into a yes-or-no verdict, learned why a machine that catches nobody can still boast 63%
  accuracy, slid the cutoff to see every bargain between lives and money at once, leashed
  an over-confident machine and met a calmer one that draws its wall in a single stroke,
  and finally walked out into the skewed, many-binned, outlier-ridden world where the
  textbook scores stop being enough.

  None of it required Greek you had not earned. That was the whole point. The labels --
  logistic regression, ROC, regularisation, LDA, cross-validation -- were never the
  understanding; they were handles screwed on at the end, once the thing was already built
  and working in your hands. Keep that order and there is very little in this field you
  cannot teach yourself.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    setting I pick by hand                hyperparameter
    grid hunt                             grid search / GridSearchCV
    rotating folds                        k-fold cross-validation
    standard ruler                        standard scaling / StandardScaler
    pinch-to-fit ruler                    min-max scaling / MinMaxScaler
    skewed pile                           class imbalance / imbalanced dataset
    precision-catch curve                 precision-recall curve
    area under precision-recall curve     average precision (AP)
    treat-all-classes-equal averaging     macro averaging
    count-every-label averaging           micro averaging

----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 3 -- Sorting Into Bins):
    Part 1 -- The S-Curve, the Four-Box Table .
    Part 2 -- The Trade Curve .
    Part 3 -- Leash and Cloud .
    Part 4 (this post)

  Appendix: Classification Reference -- all terms in one place
  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================