Classification Lab: Loss, Leash, Grid, and All the Terms

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  APPENDIX . CLASSIFICATION REFERENCE
  Loss, Leash, Grid, and All the Terms
  Posted: 2026-06-06 . Author: Rahul Rai . Tags: classification, log-loss, lda, grid-search
  ============================================================================================

  PATH . APPENDIX -- Classification Reference  (companion to Chapter 3; read any time)
    <- back to:  Chapter 3, Part 4: Picking Settings, Skewed Piles

  This is the appendix to Chapter 3 (Sorting Into Bins). Unlike the chapters -- which draw
  each idea by hand first and keep the code at the end -- this is a flip-to reference, so
  concept and code sit side by side for quick lookup. It gathers the key classification
  ideas, from the four-part chapter and the Coursera assignment, into ten tight sections.
  Read the chapter first; come back here when you need to look something up. Plain language
  first, standard labels at the very bottom.


  ## 1. Why Cross-Entropy, Not MSE

  The straight-stick rule (linear regression) minimises squared leftovers: if the guess is
  2.3 and the truth is 2.0, the leftover is (2.3-2.0)^2 = 0.09. Fine for sliding numbers.
  For a bin-sorter the truth is 0 or 1 and the guess is a chance between 0 and 1. Squared
  distance is the wrong ruler here -- it punishes a wrong answer by the same amount whether
  the machine was nearly right or bone-backwards.

  Cross-entropy (log-loss) punishes backwards confidence exponentially:

      L = -(1/n) sum [ y * log(p) + (1-y) * log(1-p) ]

  where p is the machine's chance output and y is the true label (0 or 1).

    truly sick (y=1), machine says p=0.99  ->  -log(0.99) ~=  0.01   tiny
    truly sick (y=1), machine says p=0.01  ->  -log(0.01) ~=  4.6    huge
    truly well (y=0), machine says p=0.01  ->  -log(0.99) ~=  0.01   tiny
    truly well (y=0), machine says p=0.99  ->  -log(0.01) ~=  4.6    huge

    work the big one on the slate:  log(0.01) = -log(100), and log(100) = 2 x log(10)
    ~= 2 x 2.303 = 4.606, so -log(0.01) ~= 4.6.  The near-miss: -log(0.99) ~= 0.01,
    since shaving 1% off 1 barely moves the log.

  >> YOUR TURN
     A truly-sick lump (y = 1) is handed chance p = 0.5 by a hedging machine.  Work its
     cross-entropy fine.

     check your slate:  fine = -log(p) = -log(0.5) = log(2) ~= 0.693.  A coin-flip
     hedge costs about 0.69 -- more than the confident-right 0.01, far less than the
     confident-wrong 4.6.  Hedging is punished gently;  being sure and wrong is not.

  The machine is penalised hardest for being CONFIDENT and WRONG. MSE would score the last
  case as (0.99-0)^2 = 0.99 x 0.99 = 0.9801, call it 0.98 -- large but bounded. Log-loss
  drives toward infinity, so the gradient always points away from confident wrong answers.

  ** KEY: USE LOG-LOSS FOR CLASSIFICATION, MSE FOR REGRESSION
     sklearn's LogisticRegression minimises log-loss by default. MSE has no ceiling on a
     chance output in [0,1], so it produces badly calibrated probabilities.


  ## 2. The Leash: L2 Regularisation and the C Parameter

  The default LogisticRegression() in sklearn is NOT a free machine -- it carries a leash
  built in:

      penalty='l2',  C=1.0   (the sklearn default)

  The leash adds a term to the loss that punishes large dials:

      total cost = log-loss + (1/C) * sum(beta_j^2)

  C is the inverse of the leash tightness. Small C = tight leash = dials squeezed toward
  zero. Large C = loose leash = dials can grow freely.

    C = 0.01   very tight -- dials squeezed hard, machine forced simple
    C = 0.1    tight
    C = 1.0    medium (sklearn default)
    C = 10     loose
    C = 1000   nearly free -- almost no squeeze

  >> YOUR TURN
     A machine has just two dials, 2 and 3, under a tight leash C = 0.1 (made-up).
     Work the leash's slice of the cost, (1/C) * sum(beta^2).

     check your slate:  1/C = 1/0.1 = 10;  sum of squares = 2^2 + 3^2 = 4 + 9 = 13;
     leash cost = 10 x 13 = 130.  Loosen the leash to C = 10 and the same dials cost
     only (1/10) x 13 = 1.3 -- the tight leash makes big dials hurt a hundred times more.

  !! WARN: C IS THE INVERSE OF LAMBDA
     In textbooks regularisation strength is written as lambda, and the leash term is
     lambda * sum(beta^2). sklearn inverts it: C = 1/lambda. More C means LESS squeeze.
     Easy to flip the direction when tuning.

  To remove the leash entirely:

      LogisticRegression(penalty=None)

  Adding L2 helps when the training pile is small relative to the number of columns, or
  when columns are correlated and dials can swing wildly without a check.


  ## 3. A Second Sorter: Linear Discriminant Analysis

  Logistic regression learns a boundary by gradient descent on log-loss. LDA takes a
  different road: it assumes each class is a cloud of points drawn from a Gaussian with a
  shared shape (covariance), computes the mean centre of each cloud, and places the boundary
  where the two clouds are equally likely to have produced a new point.

    logistic regression  -- learns boundary from data; no cloud-shape assumption
    LDA                  -- assumes Gaussian clouds, equal shape; boundary from cloud means

  The boundary LDA draws is a LINEAR wall -- the same kind as logistic regression -- but
  computed analytically, not by gradient descent. The wall normal direction is:

      w = S_W^-1 * (mu1 - mu0)

  where S_W is the within-class scatter (pooled covariance) and mu0, mu1 are the two class
  mean vectors. The wall is placed at:

      w0 = -1/2 * (mu0+mu1)^T * S_W^-1 * (mu1-mu0) + log(pi1/pi0)

  The log(pi1/pi0) term is the LOG-PRIOR: if the sick pile is smaller (pi1 < pi0), the
  wall shifts toward the sick cloud -- the machine is already sceptical about sick cases.

  >> NOTE: EQUAL PRIORS PUTS THE WALL AT THE MIDPOINT
     If pi0 = pi1 = 0.5, the log-prior term vanishes and the wall sits exactly halfway
     between the two cloud centres. sklearn's default uses EMPIRICAL priors (class
     frequencies in the training data). Wisconsin data is ~63% benign, so the wall shifts
     off the midpoint.

      lda     = LinearDiscriminantAnalysis()               # empirical priors, shifted wall
      lda_mid = LinearDiscriminantAnalysis(priors=[0.5,0.5])  # equal priors, midpoint wall


  ## 4. Settings vs Dials: Hyperparameters

  Every machine has two kinds of knobs:

    dials (parameters)     -- set BY THE MACHINE during training to fit the data
                              e.g. beta_0 ... beta_30 in logistic regression

    settings (hyperparams) -- set BY YOU before training; the machine never touches them
                              e.g. C in LogisticRegression, n_neighbors in KNN

  A setting controls HOW the machine learns, not WHAT it learns. You cannot find the best
  setting by watching the training pile -- the machine can always overfit if you give it
  enough slack. You find it by watching a HELD-OUT pile (validation fold).

    dials    ->  machine finds them   (by gradient descent, or analytically)
    settings ->  YOU find them        (by grid search over a validation set)


  ## 5. The Grid Hunt: Finding the Best Setting

  You want the best C. Candidates: [0.01, 0.1, 1, 10]. Do NOT just try each on the
  training pile -- the machine memorised that pile and will look better the looser the
  leash. Instead ROTATE: split the training pile into k folds (k=5 is common), train on
  k-1 folds, score on the left-out fold, rotate, repeat k times, average the scores.

    first  pick a range of candidate settings
    then   for each candidate, score it with k-fold cross-validation
    so     pick the candidate with the best mean score

  This is grid search. "Grid" because you can tune multiple settings at once -- a 2-D grid
  of C values crossed with penalty types, for example.

  !! WARN: SCALE INSIDE THE FOLD, NOT BEFORE
     If you StandardScale the whole training pile first, then pass it to GridSearchCV, each
     validation fold was shaped by a scaler that already saw it. The mean and spread used to
     scale the held-out data leaked out of it. Put the scaler INSIDE a Pipeline:

          from sklearn.pipeline import Pipeline
          from sklearn.model_selection import GridSearchCV

          pipe = Pipeline([
              ('scaler', StandardScaler()),
              ('clf',    LogisticRegression()),
          ])
          param_grid = {'clf__C': [0.01, 0.1, 1, 10]}
          gs = GridSearchCV(pipe, param_grid, cv=5, scoring='recall')
          gs.fit(X_train, y_train)    # pass RAW X_train here, not pre-scaled

  Three things you must specify: (1) the range and spacing of candidates, (2) how many
  folds to rotate (k), (3) which score to optimise (recall? F1? AUC?).


  ## 6. Pinch-to-Fit: Min-Max Scaling

  StandardScaler shifts each column to mean 0, spread 1. Min-max scaling squeezes each
  column into the range [0, 1] instead:

      x_scaled = (x - min(x)) / (max(x) - min(x))

  Every column is "pinched" to fit between 0 and 1, keeping the distribution shape but
  compressing the range.

    raw:      [ 100,  200,  400,  800 ]
    min=100, max=800, range=700
    scaled:   [ 0.0,  0.14, 0.43, 1.0 ]

  !! WARN: ONE OUTLIER SQUASHES EVERYTHING ELSE
     If a column has one value of 10000 and everything else sits between 1 and 100, the
     min-max range is ~9999. After scaling, the bulk of the data squeezes into 0 to 0.01 --
     a thin sliver. Standard scaling handles outliers better.

    use STANDARD (mean 0, spread 1)  when: roughly bell-shaped, no extreme outliers
    use MIN-MAX  (0 to 1)            when: bounded range required, known clean limits,
                                           or feeding a neural network / image model


  ## 7. The Lives-vs-Money Trade: Precision and Recall in Business

  Precision and recall pull in opposite directions. The business context decides which to
  favour.

    recall    = CAUGHT / (CAUGHT + MISSED)   -- share of truly sick cases found
    precision = CAUGHT / (CAUGHT + ALARM)    -- share of sick shouts that were real

    scenario A: cancer screening
      MISSED = cancer sent home untreated = catastrophic
      ALARM  = extra biopsy = costly but survivable
      -> maximise RECALL, tolerate lower precision

    scenario B: spam filter
      MISSED = spam in inbox = annoying
      ALARM  = good email blocked = catastrophic (missed invoice, job offer)
      -> maximise PRECISION, tolerate spam slipping through

    scenario C: fraud detection
      MISSED = fraud unblocked = costly to the bank
      ALARM  = good transaction blocked = customer complaint
      -> tune recall first, set a floor on precision

  Raising the cutoff (more sure before shouting sick) raises precision and drops recall.
  Lowering the cutoff raises recall and drops precision. F1 is the harmonic mean of the
  two -- it collapses toward zero if EITHER one is near zero.

  >> NOTE: USE F-BETA TO TILT THE TRADE
     F1 weights precision and recall equally. F-beta with beta > 1 weights recall more
     heavily. F2 (beta=2) counts a miss twice as costly as a false alarm.


  ## 8. The Trade Curve Revisited: When AUC Misleads

  The trade curve (ROC curve) sweeps the cutoff from 1 to 0 and plots (FPR, TPR) at each
  step. AUC is the area under that curve: 1.0 = perfect, 0.5 = coin flip.

  AUC is cutoff-independent and answers "how cleanly do the two groups separate?" It is the
  right score to COMPARE two machines before deciding where to set the cutoff.

  But AUC has a blind spot: it uses FPR = ALARM / all truly well as its x-axis. When the
  well pile is huge (a rare-disease screen: 1 sick per 100 well), FPR has a large
  denominator and stays small even when there are many alarms. The ROC curve looks
  optimistic. Precision -- CAUGHT / (CAUGHT + ALARM) -- tells a different story: most
  "sick" shouts are wrong.

  For SKEWED PILES, the Precision-Recall curve tells the truth:

    x-axis = RECALL     (how many sick cases found)
    y-axis = PRECISION  (of the sick shouts, how many were real)

  The PR curve ignores the true-negative count entirely, so it cannot be flattered by a
  large well pile. High area under the PR curve means: the machine finds sick cases AND its
  sick shouts are reliable.

  ** KEY: USE ROC/AUC FOR BALANCED CLASSES; USE PR CURVE FOR SKEWED CLASSES
     A machine with AUC 0.95 can have precision 0.10 on a 1:100 sick-to-well pile.
     PR curves surface this; ROC curves hide it.


  ## 9. Skewed Piles: What Goes Wrong and How to Fix It

  Skewed (imbalanced) classes are the norm in real classification tasks:
  fraud (< 1% positive), rare disease (few % positive), churn (10-20% positive).

  When the sick pile is tiny, accuracy flatters the lazy machine:

    pile: 95 well, 5 sick
    machine: shout well for everything
    accuracy  = 95/100 = 0.95   <- looks great
    recall    = 0/5    = 0.00   <- catches nobody
    precision = N/A             <- never shouted sick

  Fixes to try when the pile is skewed:

    1. report recall and precision instead of accuracy
    2. use the PR curve instead of the ROC curve
    3. tune the cutoff: lower it to increase recall at the cost of precision
    4. oversample the rare class (SMOTE), undersample the common class, or
       use class_weight='balanced' in sklearn to upweight the rare class

      LogisticRegression(class_weight='balanced')
      -> internally scales the log-loss contribution of each class by
         n_samples / (n_classes * n_samples_per_class)

  The machine sees each rare-class mistake as proportionally more costly, so it stops
  defaulting to the common class.


  ## 10. Counting Across Classes: Macro, Micro, Weighted

  When there are more than two bins -- say, tumour types A, B, C -- you get one precision
  and one recall per class. Three ways to collapse to a single number:

    MACRO:    compute the metric per class, average with EQUAL WEIGHT
              -> every class counts the same, rare and common alike

    MICRO:    pool all CAUGHT, ALARM, MISSED across classes first, THEN compute
              -> dominated by the biggest class; equals accuracy for precision/recall/F1

    WEIGHTED: average the per-class metrics, weighted by each class's count
              -> common classes count more, rare classes less

  Example with three classes, sizes 80, 15, 5:

              class A   class B   class C
    recall:     0.90      0.60      0.30
    count:        80        15         5

    macro    = (0.90 + 0.60 + 0.30) / 3             = 0.60
    weighted = (0.90*80 + 0.60*15 + 0.30*5) / 100  = 0.82
    micro    = (all TP) / (all TP + all FN)         ~= 0.84  (dominated by class A)

  macro treats a 5-sample class the same as an 80-sample class. Use it when all classes
  matter equally. Use weighted when you care more about getting the big classes right.
  Use micro when total correct counts are what matters.

  !! WARN: classification_report DOES NOT PRINT "micro avg"
     For single-label classification, micro precision/recall/F1 all equal accuracy, so
     sklearn prints "accuracy" instead of a "micro avg" row. To get micro explicitly:

          from sklearn.metrics import precision_recall_fscore_support

          micro = precision_recall_fscore_support(y_test, y_pred, average='micro')
          macro = precision_recall_fscore_support(y_test, y_pred, average='macro')


  ## The Labels, Last

    Plain term used above                  Standard label
    ------------------------------------   -------------------------------------------
    cross-entropy leftover                 binary cross-entropy / log-loss
    dial squeeze                           L2 regularisation / ridge penalty
    leash tightness (inverse)              C (regularisation parameter in sklearn)
    lambda                                 regularisation strength (C = 1/lambda)
    linear separator from cloud means      Linear Discriminant Analysis (LDA)
    log(pi1/pi0)                           log-prior ratio / class-balance offset
    within-class scatter                   S_W (pooled within-class covariance matrix)
    setting not learned by the machine     hyperparameter
    dial learned by the machine            parameter / coefficient / weight
    grid hunt over validation folds        grid search + cross-validation / GridSearchCV
    rotating folds                         k-fold cross-validation (cv=k)
    pinch-to-fit scaling                   min-max normalisation / MinMaxScaler
    recall matters more                    high-recall regime; use F-beta (beta > 1)
    precision matters more                 high-precision regime; use F-beta (beta < 1)
    curve of precision vs recall           Precision-Recall (PR) curve
    area under PR curve                    AUCPR / average precision score
    equal weight per class                 macro average
    count-weighted per class               weighted average
    pool all counts first                  micro average
    upweight the rare class                class_weight='balanced'

----------------------------------------------------------------------------------------------
  SEE ALSO (Chapter 3 -- Sorting Into Bins):
    Part 1 -- The S-Curve, the Four-Box Table .
    Part 2 -- The Trade Curve .
    Part 3 -- Leash and Cloud .
    Part 4 -- Picking Settings, Skewed Piles

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================