==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
APPENDIX . CLASSIFICATION REFERENCE
Loss, Leash, Grid, and All the Terms
Posted: 2026-06-06 . Author: Rahul Rai . Tags: classification, log-loss, lda, grid-search
============================================================================================
PATH . APPENDIX -- Classification Reference (companion to Chapter 3; read any time)
<- back to: Chapter 3, Part 4: Picking Settings, Skewed Piles
This is the appendix to Chapter 3 (Sorting Into Bins). Unlike the chapters -- which draw
each idea by hand first and keep the code at the end -- this is a flip-to reference, so
concept and code sit side by side for quick lookup. It gathers the key classification
ideas, from the four-part chapter and the Coursera assignment, into ten tight sections.
Read the chapter first; come back here when you need to look something up. Plain language
first, standard labels at the very bottom.
## 1. Why Cross-Entropy, Not MSE
The straight-stick rule (linear regression) minimises squared leftovers: if the guess is
2.3 and the truth is 2.0, the leftover is (2.3-2.0)^2 = 0.09. Fine for sliding numbers.
For a bin-sorter the truth is 0 or 1 and the guess is a chance between 0 and 1. Squared
distance is the wrong ruler here -- it punishes a wrong answer by the same amount whether
the machine was nearly right or bone-backwards.
Cross-entropy (log-loss) punishes backwards confidence exponentially:
L = -(1/n) sum [ y * log(p) + (1-y) * log(1-p) ]
where p is the machine's chance output and y is the true label (0 or 1).
truly sick (y=1), machine says p=0.99 -> -log(0.99) ~= 0.01 tiny
truly sick (y=1), machine says p=0.01 -> -log(0.01) ~= 4.6 huge
truly well (y=0), machine says p=0.01 -> -log(0.99) ~= 0.01 tiny
truly well (y=0), machine says p=0.99 -> -log(0.01) ~= 4.6 huge
work the big one on the slate: log(0.01) = -log(100), and log(100) = 2 x log(10)
~= 2 x 2.303 = 4.606, so -log(0.01) ~= 4.6. The near-miss: -log(0.99) ~= 0.01,
since shaving 1% off 1 barely moves the log.
>> YOUR TURN
A truly-sick lump (y = 1) is handed chance p = 0.5 by a hedging machine. Work its
cross-entropy fine.
check your slate: fine = -log(p) = -log(0.5) = log(2) ~= 0.693. A coin-flip
hedge costs about 0.69 -- more than the confident-right 0.01, far less than the
confident-wrong 4.6. Hedging is punished gently; being sure and wrong is not.
The machine is penalised hardest for being CONFIDENT and WRONG. MSE would score the last
case as (0.99-0)^2 = 0.99 x 0.99 = 0.9801, call it 0.98 -- large but bounded. Log-loss
drives toward infinity, so the gradient always points away from confident wrong answers.
** KEY: USE LOG-LOSS FOR CLASSIFICATION, MSE FOR REGRESSION
sklearn's LogisticRegression minimises log-loss by default. MSE has no ceiling on a
chance output in [0,1], so it produces badly calibrated probabilities.
## 2. The Leash: L2 Regularisation and the C Parameter
The default LogisticRegression() in sklearn is NOT a free machine -- it carries a leash
built in:
penalty='l2', C=1.0 (the sklearn default)
The leash adds a term to the loss that punishes large dials:
total cost = log-loss + (1/C) * sum(beta_j^2)
C is the inverse of the leash tightness. Small C = tight leash = dials squeezed toward
zero. Large C = loose leash = dials can grow freely.
C = 0.01 very tight -- dials squeezed hard, machine forced simple
C = 0.1 tight
C = 1.0 medium (sklearn default)
C = 10 loose
C = 1000 nearly free -- almost no squeeze
>> YOUR TURN
A machine has just two dials, 2 and 3, under a tight leash C = 0.1 (made-up).
Work the leash's slice of the cost, (1/C) * sum(beta^2).
check your slate: 1/C = 1/0.1 = 10; sum of squares = 2^2 + 3^2 = 4 + 9 = 13;
leash cost = 10 x 13 = 130. Loosen the leash to C = 10 and the same dials cost
only (1/10) x 13 = 1.3 -- the tight leash makes big dials hurt a hundred times more.
!! WARN: C IS THE INVERSE OF LAMBDA
In textbooks regularisation strength is written as lambda, and the leash term is
lambda * sum(beta^2). sklearn inverts it: C = 1/lambda. More C means LESS squeeze.
Easy to flip the direction when tuning.
To remove the leash entirely:
LogisticRegression(penalty=None)
Adding L2 helps when the training pile is small relative to the number of columns, or
when columns are correlated and dials can swing wildly without a check.
## 3. A Second Sorter: Linear Discriminant Analysis
Logistic regression learns a boundary by gradient descent on log-loss. LDA takes a
different road: it assumes each class is a cloud of points drawn from a Gaussian with a
shared shape (covariance), computes the mean centre of each cloud, and places the boundary
where the two clouds are equally likely to have produced a new point.
logistic regression -- learns boundary from data; no cloud-shape assumption
LDA -- assumes Gaussian clouds, equal shape; boundary from cloud means
The boundary LDA draws is a LINEAR wall -- the same kind as logistic regression -- but
computed analytically, not by gradient descent. The wall normal direction is:
w = S_W^-1 * (mu1 - mu0)
where S_W is the within-class scatter (pooled covariance) and mu0, mu1 are the two class
mean vectors. The wall is placed at:
w0 = -1/2 * (mu0+mu1)^T * S_W^-1 * (mu1-mu0) + log(pi1/pi0)
The log(pi1/pi0) term is the LOG-PRIOR: if the sick pile is smaller (pi1 < pi0), the
wall shifts toward the sick cloud -- the machine is already sceptical about sick cases.
>> NOTE: EQUAL PRIORS PUTS THE WALL AT THE MIDPOINT
If pi0 = pi1 = 0.5, the log-prior term vanishes and the wall sits exactly halfway
between the two cloud centres. sklearn's default uses EMPIRICAL priors (class
frequencies in the training data). Wisconsin data is ~63% benign, so the wall shifts
off the midpoint.
lda = LinearDiscriminantAnalysis() # empirical priors, shifted wall
lda_mid = LinearDiscriminantAnalysis(priors=[0.5,0.5]) # equal priors, midpoint wall
## 4. Settings vs Dials: Hyperparameters
Every machine has two kinds of knobs:
dials (parameters) -- set BY THE MACHINE during training to fit the data
e.g. beta_0 ... beta_30 in logistic regression
settings (hyperparams) -- set BY YOU before training; the machine never touches them
e.g. C in LogisticRegression, n_neighbors in KNN
A setting controls HOW the machine learns, not WHAT it learns. You cannot find the best
setting by watching the training pile -- the machine can always overfit if you give it
enough slack. You find it by watching a HELD-OUT pile (validation fold).
dials -> machine finds them (by gradient descent, or analytically)
settings -> YOU find them (by grid search over a validation set)
## 5. The Grid Hunt: Finding the Best Setting
You want the best C. Candidates: [0.01, 0.1, 1, 10]. Do NOT just try each on the
training pile -- the machine memorised that pile and will look better the looser the
leash. Instead ROTATE: split the training pile into k folds (k=5 is common), train on
k-1 folds, score on the left-out fold, rotate, repeat k times, average the scores.
first pick a range of candidate settings
then for each candidate, score it with k-fold cross-validation
so pick the candidate with the best mean score
This is grid search. "Grid" because you can tune multiple settings at once -- a 2-D grid
of C values crossed with penalty types, for example.
!! WARN: SCALE INSIDE THE FOLD, NOT BEFORE
If you StandardScale the whole training pile first, then pass it to GridSearchCV, each
validation fold was shaped by a scaler that already saw it. The mean and spread used to
scale the held-out data leaked out of it. Put the scaler INSIDE a Pipeline:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression()),
])
param_grid = {'clf__C': [0.01, 0.1, 1, 10]}
gs = GridSearchCV(pipe, param_grid, cv=5, scoring='recall')
gs.fit(X_train, y_train) # pass RAW X_train here, not pre-scaled
Three things you must specify: (1) the range and spacing of candidates, (2) how many
folds to rotate (k), (3) which score to optimise (recall? F1? AUC?).
## 6. Pinch-to-Fit: Min-Max Scaling
StandardScaler shifts each column to mean 0, spread 1. Min-max scaling squeezes each
column into the range [0, 1] instead:
x_scaled = (x - min(x)) / (max(x) - min(x))
Every column is "pinched" to fit between 0 and 1, keeping the distribution shape but
compressing the range.
raw: [ 100, 200, 400, 800 ]
min=100, max=800, range=700
scaled: [ 0.0, 0.14, 0.43, 1.0 ]
!! WARN: ONE OUTLIER SQUASHES EVERYTHING ELSE
If a column has one value of 10000 and everything else sits between 1 and 100, the
min-max range is ~9999. After scaling, the bulk of the data squeezes into 0 to 0.01 --
a thin sliver. Standard scaling handles outliers better.
use STANDARD (mean 0, spread 1) when: roughly bell-shaped, no extreme outliers
use MIN-MAX (0 to 1) when: bounded range required, known clean limits,
or feeding a neural network / image model
## 7. The Lives-vs-Money Trade: Precision and Recall in Business
Precision and recall pull in opposite directions. The business context decides which to
favour.
recall = CAUGHT / (CAUGHT + MISSED) -- share of truly sick cases found
precision = CAUGHT / (CAUGHT + ALARM) -- share of sick shouts that were real
scenario A: cancer screening
MISSED = cancer sent home untreated = catastrophic
ALARM = extra biopsy = costly but survivable
-> maximise RECALL, tolerate lower precision
scenario B: spam filter
MISSED = spam in inbox = annoying
ALARM = good email blocked = catastrophic (missed invoice, job offer)
-> maximise PRECISION, tolerate spam slipping through
scenario C: fraud detection
MISSED = fraud unblocked = costly to the bank
ALARM = good transaction blocked = customer complaint
-> tune recall first, set a floor on precision
Raising the cutoff (more sure before shouting sick) raises precision and drops recall.
Lowering the cutoff raises recall and drops precision. F1 is the harmonic mean of the
two -- it collapses toward zero if EITHER one is near zero.
>> NOTE: USE F-BETA TO TILT THE TRADE
F1 weights precision and recall equally. F-beta with beta > 1 weights recall more
heavily. F2 (beta=2) counts a miss twice as costly as a false alarm.
## 8. The Trade Curve Revisited: When AUC Misleads
The trade curve (ROC curve) sweeps the cutoff from 1 to 0 and plots (FPR, TPR) at each
step. AUC is the area under that curve: 1.0 = perfect, 0.5 = coin flip.
AUC is cutoff-independent and answers "how cleanly do the two groups separate?" It is the
right score to COMPARE two machines before deciding where to set the cutoff.
But AUC has a blind spot: it uses FPR = ALARM / all truly well as its x-axis. When the
well pile is huge (a rare-disease screen: 1 sick per 100 well), FPR has a large
denominator and stays small even when there are many alarms. The ROC curve looks
optimistic. Precision -- CAUGHT / (CAUGHT + ALARM) -- tells a different story: most
"sick" shouts are wrong.
For SKEWED PILES, the Precision-Recall curve tells the truth:
x-axis = RECALL (how many sick cases found)
y-axis = PRECISION (of the sick shouts, how many were real)
The PR curve ignores the true-negative count entirely, so it cannot be flattered by a
large well pile. High area under the PR curve means: the machine finds sick cases AND its
sick shouts are reliable.
** KEY: USE ROC/AUC FOR BALANCED CLASSES; USE PR CURVE FOR SKEWED CLASSES
A machine with AUC 0.95 can have precision 0.10 on a 1:100 sick-to-well pile.
PR curves surface this; ROC curves hide it.
## 9. Skewed Piles: What Goes Wrong and How to Fix It
Skewed (imbalanced) classes are the norm in real classification tasks:
fraud (< 1% positive), rare disease (few % positive), churn (10-20% positive).
When the sick pile is tiny, accuracy flatters the lazy machine:
pile: 95 well, 5 sick
machine: shout well for everything
accuracy = 95/100 = 0.95 <- looks great
recall = 0/5 = 0.00 <- catches nobody
precision = N/A <- never shouted sick
Fixes to try when the pile is skewed:
1. report recall and precision instead of accuracy
2. use the PR curve instead of the ROC curve
3. tune the cutoff: lower it to increase recall at the cost of precision
4. oversample the rare class (SMOTE), undersample the common class, or
use class_weight='balanced' in sklearn to upweight the rare class
LogisticRegression(class_weight='balanced')
-> internally scales the log-loss contribution of each class by
n_samples / (n_classes * n_samples_per_class)
The machine sees each rare-class mistake as proportionally more costly, so it stops
defaulting to the common class.
## 10. Counting Across Classes: Macro, Micro, Weighted
When there are more than two bins -- say, tumour types A, B, C -- you get one precision
and one recall per class. Three ways to collapse to a single number:
MACRO: compute the metric per class, average with EQUAL WEIGHT
-> every class counts the same, rare and common alike
MICRO: pool all CAUGHT, ALARM, MISSED across classes first, THEN compute
-> dominated by the biggest class; equals accuracy for precision/recall/F1
WEIGHTED: average the per-class metrics, weighted by each class's count
-> common classes count more, rare classes less
Example with three classes, sizes 80, 15, 5:
class A class B class C
recall: 0.90 0.60 0.30
count: 80 15 5
macro = (0.90 + 0.60 + 0.30) / 3 = 0.60
weighted = (0.90*80 + 0.60*15 + 0.30*5) / 100 = 0.82
micro = (all TP) / (all TP + all FN) ~= 0.84 (dominated by class A)
macro treats a 5-sample class the same as an 80-sample class. Use it when all classes
matter equally. Use weighted when you care more about getting the big classes right.
Use micro when total correct counts are what matters.
!! WARN: classification_report DOES NOT PRINT "micro avg"
For single-label classification, micro precision/recall/F1 all equal accuracy, so
sklearn prints "accuracy" instead of a "micro avg" row. To get micro explicitly:
from sklearn.metrics import precision_recall_fscore_support
micro = precision_recall_fscore_support(y_test, y_pred, average='micro')
macro = precision_recall_fscore_support(y_test, y_pred, average='macro')
## The Labels, Last
Plain term used above Standard label
------------------------------------ -------------------------------------------
cross-entropy leftover binary cross-entropy / log-loss
dial squeeze L2 regularisation / ridge penalty
leash tightness (inverse) C (regularisation parameter in sklearn)
lambda regularisation strength (C = 1/lambda)
linear separator from cloud means Linear Discriminant Analysis (LDA)
log(pi1/pi0) log-prior ratio / class-balance offset
within-class scatter S_W (pooled within-class covariance matrix)
setting not learned by the machine hyperparameter
dial learned by the machine parameter / coefficient / weight
grid hunt over validation folds grid search + cross-validation / GridSearchCV
rotating folds k-fold cross-validation (cv=k)
pinch-to-fit scaling min-max normalisation / MinMaxScaler
recall matters more high-recall regime; use F-beta (beta > 1)
precision matters more high-precision regime; use F-beta (beta < 1)
curve of precision vs recall Precision-Recall (PR) curve
area under PR curve AUCPR / average precision score
equal weight per class macro average
count-weighted per class weighted average
pool all counts first micro average
upweight the rare class class_weight='balanced'
----------------------------------------------------------------------------------------------
SEE ALSO (Chapter 3 -- Sorting Into Bins):
Part 1 -- The S-Curve, the Four-Box Table .
Part 2 -- The Trade Curve .
Part 3 -- Leash and Cloud .
Part 4 -- Picking Settings, Skewed Piles
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================