==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 3 . SORTING INTO BINS . PART 3 OF 4
Leash and Cloud: L2 Punishment and the Two-Cloud Wall
Posted: 2026-06-05 . Author: Rahul Rai . Tags: l2-regularization, lda, generative-models
============================================================================================
PATH . post 8 of 28
<- prev: Sorting 2: The Trade Curve
next: Sorting 4: Picking Settings, Skewed Piles ->
An over-confident machine is a dangerous thing, and the S-curve machine from Part 1 --
one dial per column, multiply and add, squash the sum into a 0-to-1 chance -- has
a streak of it. It sets its dials by rolling downhill on the cross-entropy leftover
(the score that charges it for each lump by how confidently wrong its chance was), and
left to its own devices nothing stops a single dial from ballooning to 50 if that is
what makes the study pile fit. A dial that large means the machine has fallen for one
column and stopped listening to the other twenty-nine -- it aces the practice exam and
then freezes on a lump it has never seen. Confident, and wrong, which is the worst way
to be.
This post is about humbling it, two different ways. The first keeps the same machine but
puts it on a TIGHTER leash. The second throws the machine out entirely and tries a
completely different temperament -- one that never rolls downhill at all, but instead
steps back, looks at the SHAPE of the two groups, and draws the wall in a single stroke.
>> NOTE: THE PART 1 BASELINE ALREADY HAS A LEASH
Worth saying up front, because it surprises people: LogisticRegression() in sklearn is
NOT a free machine. Its defaults are penalty='l2', C=1.0 -- so the Part 1 baseline
already carries an L2 leash of medium strength. This post is therefore not "no leash
vs leash"; it is MEDIUM LEASH (C=1.0) vs TIGHTER LEASH (C=0.1). To see a genuinely
unleashed machine you would have to pass penalty=None explicitly.
## What a Loose Leash Lets Happen
default machine (C=1.0): medium leash -> dials kept modest, but a dominant
column can still pull its dial fairly large
truly free (penalty=None): no leash -> one dial might reach 40 or -60
patient row -> *dials -> add -> squash -> chance
if column 7's dial = 40:
a shift of 0.1 in column 7 swings the sum by 40 x 0.1 = 4.0
-> chance jumps from 0.1 to 0.98
the machine bets almost everything on one column
-> cocky on study pile, shaky on new lumps
So the question this post really asks is: the default C=1.0 leash is fine, but what
happens if we pull it TIGHTER? Do the dials shrink further, and does the machine get more
humble?
## A Leash on the Dials (L2 Penalty)
Add a price tag to large dials. Every extra unit of dial size costs something. The
machine now minimises two things at once: fit the study pile well AND keep the dials
small. The penalised objective is:
L_L2(beta) = L(beta) + lambda * sum_j beta_j^2
The first term is the cross-entropy from Part 1 -- the plain-fit score: for each lump,
take the chance the machine gave the TRUE bin, and fine it by -log of that chance, so a
confident wrong chance costs a lot and a hedged one costs little. The second is the sum
of squared dials times a strength lambda. When lambda is large, even a small dial^2 sum gets expensive,
and the machine is forced to shrink everything toward zero. The gradient update picks up
a pull-toward-zero term:
dL_L2/dbeta_j = (1/n) sum_i ( sig(z_i) - y_i ) x_ij + 2*lambda*beta_j
The term 2*lambda*beta_j pulls every dial toward zero at every step. Result: dials spread
across all 30 columns instead of concentrating on one.
>> NOTE: BAYESIAN READING
Adding a squared-dial penalty is the same as placing a zero-mean Gaussian prior on
each dial and maximising the posterior instead of the likelihood. The prior says "a
dial as large as 40 is very surprising; please explain." The data can override this
prior if the evidence is strong enough; otherwise the dials stay modest.
## The C Parameter: Counter-Intuitive Direction
sklearn spells the leash strength as C, not lambda. They are reciprocals:
C = 1 / lambda <=> lambda = 1 / C
C = 0.1 -> lambda = 10 heavy penalty -- dials squeezed hard
C = 1.0 -> lambda = 1 sklearn default
C = 1000 -> lambda = 0.001 barely any penalty -- nearly free machine
check each on the slate: 1/0.1 = 10; 1/1.0 = 1; 1/1000 = 0.001
!! WARN: C=0.1 IS HEAVY PUNISHMENT, NOT LIGHT
C is the budget you give the machine to IGNORE the penalty. A small budget means
little room to ignore it -- a hard squeeze. A large C means a large "ignore" budget --
the penalty barely bites. "C small = heavy leash" is counter-intuitive until you
remember C = 1/lambda.
## Did the Dials Shrink?
IN HAND: one S-curve machine, its leftover now two-part -- plain fit plus a fine of
lambda times the sum of squared dials -- and the dictionary C = 1/lambda, so the tight
setting C = 0.1 means lambda = 1/0.1 = 10. This section adds the receipt: proof the
dials actually shrank.
Compare the average absolute dial size for the default leash (C=1.0) against the tighter
leash (C=0.1): if the tighter leash is doing its job, its average dial comes out smaller.
(The two-line check is in the code at the end of the post.)
A concrete 3-column example, by pencil. Suppose only 3 columns -- x1, x2, x3 --
and two machines trained on the same data:
dial free (no leash) default C=1.0 tight C=0.1
------------------------------------------------------------
x1 dial +8.2 +3.1 +1.4
x2 dial -5.7 -2.4 -0.9
x3 dial +0.3 +0.2 +0.1
|dial| average:
free: (8.2 + 5.7 + 0.3) / 3 = 14.2 / 3 = 4.73
C=1.0: (3.1 + 2.4 + 0.2) / 3 = 5.7 / 3 = 1.90
C=0.1: (1.4 + 0.9 + 0.1) / 3 = 2.4 / 3 = 0.80
The free machine lets x1 balloon to 8.2 -- it bets heavily on one
column. C=1.0 pulls it to 3.1. C=0.1 pulls it to 1.4. The tighter
the leash, the more the machine spreads trust across all 3 columns.
>> YOUR TURN
Charge the FREE machine's dials (+8.2, -5.7, +0.3) the fine the tight leash
charges: lambda = 10 times the sum of squared dials. Work it on the slate
before reading on.
check your slate: 8.2^2 = 67.24; 5.7^2 = 32.49; 0.3^2 = 0.09; sum = 67.24 +
32.49 + 0.09 = 99.82; fine = 10 x 99.82 = 998.2. The tight machine's own dials
cost only 10 x (1.4^2 + 0.9^2 + 0.1^2) = 10 x (1.96 + 0.81 + 0.01) = 10 x 2.78 =
27.8. Ballooned dials cost about 36 times as much -- that is exactly the pressure
that makes the machine shrink them.
The absolute value is needed because a +3 and a -3 cancel in a plain
average, making the machine look like it has no signal at all. The
absolute value measures SIZE regardless of sign.
>> NOTE: WHY ABSOLUTE VALUE BEFORE AVERAGING
Dials can be positive or negative. A positive dial +3 and a negative dial -3 cancel to
zero in a plain average, making the machine look like it has no signal at all. Taking
the absolute value first measures the SIZE of each dial regardless of sign, then
averages those sizes. What you want to compare is pull strength, not direction.
## Scaling and the Leash Are Not the Same Fix
Scaling (put columns on one ruler): fixes the INPUTS
L2 leash: fixes the DIALS
Even after scaling, a dial can grow huge if the machine
over-trusts one column. You need both. They solve different problems.
Scaling makes the 30 input columns comparable before the machine sees them. The leash
limits how large any individual dial grows during the dial-setting step. Removing either
one leaves a different problem unsolved.
## A Completely Different Machine: The Two-Cloud Wall
IN HAND: the leashed S-curve machine -- fit plus lambda times squared dials, tight
setting C = 0.1 (lambda = 1/0.1 = 10) -- and the receipt that its dials shrank: average
size (1.4 + 0.9 + 0.1)/3 = 2.4/3 = 0.80 against the default's (3.1 + 2.4 + 0.2)/3 =
5.7/3 = 1.90. This section adds a second machine of the opposite temperament.
Everything so far kept the same machine and tightened its leash. Now we change the
machine itself. The S-curve machine is a fidgeter -- it inches toward the answer by trial
and error: adjust, check the leftover, adjust again. Linear discriminant analysis (LDA)
has the opposite personality. It does not fiddle at all. It stands back, studies the
SHAPE of the two groups of points, and lays down the wall between them in a single
confident stroke.
first split the study pile into sick rows and well rows
then compute 30 averages per group -> two centres in 30-column space
(60 averages total: 30 per class, 2 class centres)
then find the wall between the two centres
so a new lump -> which side of the wall? -> that is the label
well centre #----------+----------# sick centre
^
wall here
## Fisher's Criterion: Where to Aim the Wall
A naive wall at the midpoint between the two centres works when both clouds are round.
Real clouds are stretched -- some directions have more spread than others, and the two
classes may share some of that stretch. Fisher's criterion asks: which direction w
maximises the RATIO of between-class spread to within-class spread?
maximise (w^T S_b w) / (w^T S_w w)
where S_b = (mu1 - mu0)(mu1 - mu0)^T (between-class scatter) and S_w = S_W (pooled
within-class scatter). The solution is closed-form -- no rolling downhill required:
w is proportional to S_W^-1 (mu1 - mu0)
Meaning: take the difference between the two class centres, then rotate it by the inverse
of the pooled within-class spread matrix. This adjusts for the tilt and shape of the
clouds -- if both clouds are elongated diagonally, the wall tilts to match.
Where does the wall sit along that direction? If the two classes are equally common, it
sits exactly at the midpoint of the projected class means:
threshold = w^T (mu0 + mu1) / 2 (only when the two classes are equally common)
A concrete 2-column, 4-person LDA walkthrough, by pencil.
Only 2 columns (bmi and bp) and 4 people:
person bmi bp truth
-----------------------------
A 0.04 0.90 sick (1)
B 0.06 0.85 sick (1)
C 0.12 0.50 well (0)
D 0.18 0.45 well (0)
First, the 60 averages (2 class means x 2 columns = 4 averages):
sick class (A, B): mu1 = ( (0.04+0.06)/2 , (0.90+0.85)/2 ) = (0.05 , 0.875)
well class (C, D): mu0 = ( (0.12+0.18)/2 , (0.50+0.45)/2 ) = (0.15 , 0.475)
Then the midpoint between the two centres (the IMAGINARY point):
midpoint = ((0.05+0.15)/2 , (0.875+0.475)/2 ) = (0.10 , 0.675)
Then the difference between centres:
mu1 - mu0 = (0.05-0.15 , 0.875-0.475) = (-0.10 , 0.400)
Then the wall NORMAL vector w (ignoring S_W for this clean round-cloud
picture; with equal covariances w = mu1 - mu0):
w = (-0.10 , 0.400)
Finally, project each person onto w and compare to the midpoint projection:
w^T midpoint = -0.10*0.10 + 0.400*0.675 = -0.01 + 0.270 = 0.260
A: w^T x = -0.10*0.04 + 0.400*0.90 = -0.004 + 0.360 = 0.356 > 0.260 -> sick
B: w^T x = -0.10*0.06 + 0.400*0.85 = -0.006 + 0.340 = 0.334 > 0.260 -> sick
C: w^T x = -0.10*0.12 + 0.400*0.50 = -0.012 + 0.200 = 0.188 < 0.260 -> well
D: w^T x = -0.10*0.18 + 0.400*0.45 = -0.018 + 0.180 = 0.162 < 0.260 -> well
All 4 classified correctly. The wall sits at 0.260 along w. New
lump with bmi=0.10, bp=0.70: w^T x = -0.01 + 0.28 = 0.270 > 0.260 -> sick.
The wall is the midpoint because both classes have equal counts here.
>> YOUR TURN
Same wall: w = (-0.10, 0.400), and a lump sits on the sick side when w^T x beats
0.260. A new lump walks in (made-up): bmi = 0.08, bp = 0.60. Score it.
check your slate: w^T x = -0.10 * 0.08 + 0.400 * 0.60 = -0.008 + 0.240 = 0.232.
0.232 < 0.260, so the lump falls on the WELL side of the wall -- called well.
But the two classes are usually NOT equally common, and that pure midpoint is a special
case. The full rule scores each new lump and adds a nudge for how common each class is:
score(x) = w^T x + w0,
w0 = -1/2 (mu0 + mu1)^T S_W^-1 (mu1 - mu0) + log(pi1 / pi0)
The first piece of w0 is the midpoint; the extra log(pi1/pi0) term slides the wall toward
the rarer class so the machine doesn't over-shout it. Here pi0 and pi1 are the class
frequencies (the priors).
!! WARN: THE WALL IS NOT EXACTLY HALFWAY ON IMBALANCED DATA
The Wisconsin sheet is roughly 63% well and 37% sick -- not equal. sklearn's
LinearDiscriminantAnalysis() uses EMPIRICAL priors by default (the actual class
frequencies), so the boundary it fits carries the log(pi1/pi0) offset and sits OFF the
midpoint. If you place a wall at the pure midpoint and expect to reproduce sklearn's
predictions, you will be off. To get the clean halfway wall, force equal priors:
LinearDiscriminantAnalysis(priors=[0.5, 0.5]).
## Generative vs Discriminative
The S-curve machine and LDA arrive at the same final form -- a linear boundary through
the 30-column space -- but they derive it through completely different reasoning:
Property S-curve machine Two-cloud wall (LDA)
------------- --------------------------- -----------------------------------
Approach models P(sick|x) directly models P(x|sick) and P(sick)
Solution iterative (roll downhill) closed form (one shot)
Assumption no distribution on x Gaussian columns, equal spread
Breaks when columns perfectly tangled spread assumption badly violated
Works better large pile, noisy columns small pile, Gaussian columns
LDA is a GENERATIVE model: it imagines each class generates Gaussian-distributed data and
uses Bayes' theorem to flip to P(sick|x). Via that flip, P(sick|x) works out to
sig(w^T x + w0) -- the same S-curve form as logistic regression. Same decision-boundary
shape, different route.
## Same Accuracy: What It Means
On the Wisconsin breast cancer sheet, LDA and the S-curve machine give nearly identical
accuracy. Two completely different approaches, same result. That is not a coincidence:
** KEY: AGREEMENT = CLEAN DATA, GENUINELY SEPARABLE CLASSES
When the two methods agree, the data is telling you the answer. A logistic machine
that maximises likelihood and an LDA that reads cloud shapes both find the same
dividing line because that line is clearly written in the data. If the sheet were
noisy or the two groups heavily overlapping, the two methods would diverge and their
disagreement would tell you something important: the boundary is ambiguous.
## Why Scaling Also Matters for LDA
LDA computes S_W, the pooled within-class spread matrix. If column "area" runs in the
thousands while column "smoothness" runs in hundredths, the area column dominates the
covariance matrix -- its large numbers swamp the matrix entries and distort the
projection direction w. Putting every column on the same ruler before feeding LDA makes
S_W well-conditioned and the projection direction meaningful.
1. The free S-curve machine lets dials grow without limit; L2 adds a squared-dial price
that shrinks them toward zero.
2. C = 1/lambda: small C (say 0.1) means large lambda (10) -- heavy squeeze.
Counter-intuitive.
3. The leash fixes dials; scaling fixed inputs. They solve different problems; need
both.
4. LDA reads the two cloud centres and their shared spread, then finds the best wall
in one closed-form step.
5. w prop. S_W^-1 (mu1 - mu0): the difference of centres, rotated by the inverse
within-class spread.
6. LDA and the S-curve machine produce the same decision-boundary form; they derive it
from opposite directions -- generative vs discriminative.
7. When both machines agree on accuracy, the data is cleanly separable. Disagreement is
diagnostic.
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
Three small things: fit the tighter-leashed machine, check its dials really did shrink,
and fit the two-cloud wall (LDA) both ways -- empirical priors and forced-equal priors.
>> NEW TO PYTHON? Each named once:
np.abs(x) -- the size of each number, sign thrown away (NumPy)
np.mean(x) -- the average of a row of numbers
Tighter leash (C=0.1 is a HEAVY squeeze -- remember C = 1/lambda):
log_reg_l2 = LogisticRegression(penalty='l2', C=0.1, random_state=42)
log_reg_l2.fit(X_train_scaled, y_train)
y_pred_l2 = log_reg_l2.predict(X_test_scaled)
Did the dials shrink? Compare the average absolute dial size, default vs tighter:
avg_coef_baseline = np.mean(np.abs(log_reg_baseline.coef_)) # C=1.0 (default)
avg_coef_l2 = np.mean(np.abs(log_reg_l2.coef_)) # C=0.1 (tighter)
# expect: avg_coef_l2 < avg_coef_baseline (the tighter leash shrinks them further)
The two-cloud wall, both ways -- sklearn's default uses empirical priors, so force equal
priors if you want the exact halfway wall:
# sklearn default: empirical priors -> wall shifted off the midpoint
lda = LinearDiscriminantAnalysis()
lda.fit(X_train_scaled, y_train)
y_pred_lda = lda.predict(X_test_scaled)
# to reproduce the exact "halfway between the centres" wall:
lda_equal = LinearDiscriminantAnalysis(priors=[0.5, 0.5])
lda_equal.fit(X_train_scaled, y_train)
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
leash on the dials L2 regularisation / ridge penalty
dial-size price regularisation term lambda*sum(beta_j^2)
C (sklearn parameter) inverse regularisation strength (C = 1/lambda)
two-cloud midpoint wall LDA (linear discriminant analysis)
cloud centre class mean mu_k
pooled within-class spread within-class scatter matrix S_W
Fisher's criterion maximise (w^T S_b w)/(w^T S_w w)
models P(x|class) generative model
models P(class|x) directly discriminative model
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 3 -- Sorting Into Bins):
Part 1 -- The S-Curve, the Four-Box Table .
Part 2 -- The Trade Curve .
Part 3 (this post) .
Part 4 -- Picking Settings, Skewed Piles
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================