==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 4 . HUMBLE DIALS AND WOBBLE BANDS . PART 1 OF 3
The Leash: Ridge, Lasso, and Humbling the Dials
Posted: 2026-06-07 . Author: Rahul Rai . Tags: ridge, lasso, regularisation, cross-validation
============================================================================================
PATH . post 10 of 28
<- prev: Chapter 3, Part 4: Picking Settings, Skewed Piles
next: Chapter 4, Part 2: One Dial Is a Lie ->
Back to guessing a number -- but this time the machine has a character flaw, and we are
going to fix it. In Chapter 1 the straight-stick rule set its dials to make the misses as
small as possible, and we let those dials spin as far as they liked. That freedom is the
flaw. Left off the leash, a dial will grow to a monster just to twitch through one more
point in the study pile -- and a machine that bends itself around every point it has seen
falls flat on its face the moment it meets someone new.
This chapter is about humility, taught two ways. Part 1 puts the dials on a LEASH so none
of them can grow into a tyrant. Part 2 asks a harder question -- how much should you even
TRUST a dial? -- and answers it by faking a hundred trips to the hospital. Part 3 lifts
the hood and shows where the dials come from in the first place.
## A New Sheet: The Diabetes Pile
diabetes -- 442 people, 10 body measurements, one answer
patient age bmi bp s1 ... s6 | answer
--------------------------------------------------
#1 0.04 0.06 0.02 ... | 151 <- disease score one YEAR later
#2 -0.01 -0.05 -0.03 ... | 75
... | ...
+--------------- 10 columns -----------+| the answer
Same job as Chapter 1: read a person's measurements, guess their number -- here a disease
progression score taken a year later. The straight-stick rule still applies, unchanged:
guess = (dial1 * age) + (dial2 * bmi) + ... + (dial10 * s6) + nudge
The dials are still whatever makes the total squared miss smallest. Nothing new yet. What
is new is that we no longer trust the machine to set those dials sensibly on its own.
## The Disease: Huge Dials That Memorise
Here is the failure, drawn. Give the dials free rein and they grow huge to chase every
little wiggle in the study pile:
small honest dial: huge memorising dial:
line follows the trend line whips up and down to touch every dot
. . . .
. . . \ . /\ . /
. . . . .\/ \/ .
smooth, travels well perfect on study, WILD on anyone new
A huge dial means the machine has fallen in love with one column and is over-trusting it
to nail the people it has already seen. It scores beautifully on the study pile, then
shatters on a new patient. That is memorising, not learning -- and we met its name in
Chapter 2: it is the jumpy half of the stiff-vs-jumpy trade. We want to FORBID huge dials.
## The Leash: Add a Fine
Until now the machine paid only for being wrong. Now we make it pay a second fine -- for
being immodest. It must drive down two things at once:
total wrongness + a * (a fine for big dials)
+- fit the data -+ +------- stay humble -------+
The letter a (the textbooks write the Greek "alpha") is a knob WE set: how hard to press
for humility.
a = 0 no fine -- the old free-spinning machine, back to memorising
a small gentle squeeze
a huge dials crushed toward nothing -- maybe too stiff now
There are two famous fines. On paper they look almost like twins. At the bottom, where it
counts, they behave completely differently -- and that one difference is the whole point
of this post.
## Ridge: The Smooth Squeeze
IN HAND: the diabetes sheet -- 442 people, 10 columns, one answer -- a straight-stick
guess built from 10 dials + 1 nudge = 11 numbers, and a bill with two charges: total
wrongness plus a times a fine for big dials. This section gives the fine its first
shape: the square.
Ridge's fine is the SQUARE of each dial, all of them added up:
make smallest: (add up every miss^2) + a * (add up every dial^2)
To feel what a square does, picture a tug-of-war over a single dial. Its usefulness pulls
it away from zero; the fine pulls it back toward zero. How hard does the fine pull? For a
square, the pull is 2*a*dial -- so as the dial gets SMALLER, the fine's grip gets weaker
too, fading to nothing at zero:
Ridge -- a smooth bowl
fine | . .
| . .
| . .
| .___. <- almost flat at the bottom: no force left near zero
+---------------------- dial size
0
So the tug-of-war settles at a dial that is SMALL but not zero. The fine drags every dial
inward, but near zero it runs out of strength, so no dial is ever shoved all the way to
exactly zero. Ridge shrinks the whole orchestra; it silences nobody.
And because that bowl is smooth, there is a clean one-step answer -- no searching. We
actually met it already, as a footnote in Chapter 1, Part 3:
the ridge rule beta = (X^T X + a*I)^-1 X^T y. The fine just adds a BUMP down the diagonal
before the solve. For one lonely dial it reads as the tidy picture:
Ridge dial = top / (bottom + a) bigger a -> bigger bottom -> smaller dial
What are "top" and "bottom"? Concrete numbers from the sheet:
top = how the column and the answer move together
= sum of (column value - column average) x (answer - answer average)
bottom = how much the column wiggles on its own
= sum of (column value - column average)^2
For ONE column (say bmi), ONE dial, on 4 people:
person bmi answer bmi-avg ans-avg (bmi-avg)*(ans-avg) (bmi-avg)^2
--------------------------------------------------------------------------------
A 0.04 97 -0.04 -31 1.24 0.0016
B 0.06 121 -0.02 -7 0.14 0.0004
C 0.12 135 0.04 7 0.28 0.0016
D 0.18 151 0.10 23 2.30 0.0100
avg: 0.08 126 ------ -------
top=3.96 bottom=0.0136
without ridge (a=0): dial = 3.96 / 0.0136 = 291
with ridge a=10: dial = 3.96 / (0.0136 + 10) = 3.96 / 10.0136 = 0.396
with ridge a=100: dial = 3.96 / (0.0136 + 100) = 3.96 / 100.0136 = 0.0396
a=0 gives the plain dial (huge). a=10 crushes it 735x smaller.
a=100 crushes it 7350x smaller. The bigger the a, the smaller the dial.
(check the crush: 0.396 x 735 = 291.06 ~ 291, and 0.0396 x 7350 = 291.06 ~ 291.)
The column-average and answer-average are computed over the 4 people. "bmi-avg"
means bmi minus the column's own average (0.08). "ans-avg" means answer minus
the answer's average (126). Multiplying those two gives the "top" piece for each
person. Adding up all four gives top = 3.96. Adding up the squares gives
bottom = 0.0136.
>> YOUR TURN
Same column (top = 3.96, bottom = 0.0136). Set the knob to a = 1 and work the
ridge dial on the slate.
check your slate: dial = top / (bottom + a) = 3.96 / (0.0136 + 1) = 3.96 /
1.0136 ~= 3.91. Even a gentle a = 1 drags the plain dial of 291 down to about
3.9 -- the bump on the bottom does nearly all the work.
** KEY: RIDGE SHRINKS ALL, ZEROES NONE, AND SOLVES IN ONE STEP
The fine's pull dies at zero, so every dial lands small-but-alive. The bump-on-the-
diagonal makes the solve exact -- and, as a bonus, fixes the "tangled columns" crash
from Chapter 1 Part 3, because a full diagonal can always be inverted.
## Lasso: The Sharp Snap
Lasso's fine is the ABSOLUTE size of each dial -- its size with the sign thrown away:
make smallest: (add up every miss^2) + a * (add up every |dial|)
Same tug-of-war, one key change: the fine's pull is now CONSTANT. For an absolute
size the pull is just a, no matter how tiny the dial gets. It does not fade near zero --
it keeps shoving at full strength right up to the edge:
Lasso -- a sharp corner
fine | \ /
| \ /
| \ /
| \ /
| \ /
| \ /
| V <- a sharp point: full force right up to zero
+---------------------- dial size
0
Now the ending changes. Even when a dial is already tiny, the fine still pushes with full
strength a. If that column's usefulness cannot beat a, the dial gets driven all the way to
exactly zero -- and PINNED there.
strong column (usefulness > a): survives, just shrunk
weak column (usefulness < a): SNAPPED to exactly 0 -> dropped from the sheet
So Lasso does more than shrink -- it DELETES useless columns by zeroing their dials. It
hands you back a shorter sheet and, with it, an answer to a question Ridge never touches:
which measurements actually mattered?
The price of that sharp corner: there is no clean formula. You cannot solve it in one
step. The machine has to loop:
1. start every dial at 0
2. nudge the bmi dial -- does its usefulness beat the fine? keep it : snap to 0
3. nudge the age dial -- same test
4. ... keep looping over all the dials until they stop moving
At the end you COUNT THE SURVIVORS -- the dials that did not snap to zero.
** KEY: LASSO SHRINKS, SNAPS WEAK DIALS TO ZERO, AND HAS NO FORMULA
Because the fine never lets up, Lasso zeroes out weak columns entirely and tells you
which ones earned their keep. The corner that buys that gift also costs the clean
solve: Lasso must loop until the dials stop moving.
## Ridge vs Lasso, Side by Side
Ridge Lasso
fine a * sum(dial^2) a * sum(|dial|)
shape smooth bowl sharp V-corner
pull near 0 fades to nothing stays at full strength
weak dials small but never 0 SNAP to exactly 0
columns keeps all 10 drops the useless ones
by hand yes: top/(bottom+a) no: must loop
gives you a humbler machine a humbler machine PLUS a shortlist
The ONLY real difference is square versus absolute -- and that single choice decides
whether the fine's pull dies at zero (Ridge: settle near zero) or stays alive (Lasso:
snap to zero). Everything else follows from it.
## Picking the Knob a -- Never Guess It
Both machines have the same knob a, and you do NOT eyeball it. You let the 5-slice check
from Chapter 1 do the choosing. Cut the study pile into 5 slices; cover one, build the
dials on the other four, guess the covered slice, write down the error; rotate until every
slice has been the covered one; average the five errors. That average is one value of a's
score. Do it for a whole ladder of a-values -- about fifty of them -- and a valley appears:
error | \ _/ too-small a -> still memorising
| \_ ___/ too-big a -> dials crushed, too stiff
| \__ __/
| \__.__/ <- the valley = the best a
+--------------------------- a (from tiny to huge)
Pick the a at the bottom of the valley. Then REBUILD the dials on the full study pile with
that winning a, and grade ONCE on the sealed exam pile. Same iron rule as always: never
pick a by peeking at the exam.
>> NOTE: THE CURVES ARE THE ANSWER, NOT A PLOT
The lab asks you to report the error-versus-a curve for both machines. The plot is
optional; the numbers behind it are the point. The winning a is wherever the curve
bottoms out -- the valley above, turned into a row of numbers.
>> YOUR TURN
One candidate a was scored by the 5-slice check; its five covered-slice errors
came out 52, 48, 55, 50, 45 (made-up). Turn them into that a's one score.
check your slate: sum = 52 + 48 + 55 + 50 + 45 = 250; score = 250 / 5 = 50.
That single 50 is one dot on the valley curve; the winning a is wherever the
row of such dots bottoms out.
## Common Tripwires I Caught
These are the exact wrong pictures I had to untangle -- each one cost me
real time until I saw the concrete shape of the mistake:
TRIPWIRE 1: 25% chop vs 2.5% chop
WRONG: the 95% band chops 25% off each end (like Lab 1's quartiles).
RIGHT: 95% means keep the middle 95% -> chop 2.5% off each end.
q0.025 and q0.975 (because 100% - 2.5% = 97.5%).
For 200 values: 2.5% of 200 = 5. Keep index 5 through 194
(counting from 0). Chop 5 smallest and 5 largest.
Reaching for 0.25 instead of 0.025 gives a band 10x too narrow.
TRIPWIRE 2: RidgeCV silently drops the error curve with a custom CV
WRONG: pass cv=kf to RidgeCV and later read ridge_cv.cv_values_.
RIGHT: RidgeCV only stores cv_values_ when using its own built-in
leave-one-out method. The moment you give it cv=kf (a custom
splitter), it stops storing those values. Later code that
reads ridge_cv.cv_values_ crashes with AttributeError.
Fix: do NOT pass cv=kf to RidgeCV. Pass store_cv_values=True
and let it use its default. (LassoCV has no such quirk.)
TRIPWIRE 3: Counting sticky notes vs measuring width
WRONG: a 95% band with 190 notes out of 200 is always the same
count, so the band tells you nothing.
RIGHT: the COUNT of notes is always 190. The WIDTH is the gap
between the number on the 5th note and the 195th note.
Two dials can both keep 190 notes but have wildly different
widths:
bmi-dial: 0.42 to 0.58 -> width 0.16 (tight, trustworthy)
s5-dial: -0.80 to +0.90 -> width 1.70 (wide, crosses 0, useless)
Same 190 notes. Different trust. Never confuse count with gap.
TRIPWIRE 4: The ~ symbol is the whole free exam
WRONG: np.isin marks the picked people. Grade on them.
RIGHT: np.isin gives True for picked people. The ~ flips to NOT.
Grade on the NOT-picked people. Without the ~, you grade
the machine on the very patients it just studied -- a fake,
over-confident score. One squiggle is the difference between
an honest score and a lie.
TRIPWIRE 5: LADDER vs EXACT dials
WRONG: the per-column slope (sum((x-xbar)(y-ybar)) / sum((x-xbar)^2))
is the true dial.
RIGHT: that slope is the correct dial ONLY when no column overlaps
another. Real columns always overlap (bmi, bp, blood serums
all drift together). The true dial needs the untangler
(X^T X)^-1 X^T y to share credit between columns that move
together. The simple slope is a LADDER (a crutch, a first
picture) -- name it that, never call it "the dial."
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
the leash / a fine for big dials regularisation
Ridge -- the square fine L2 regularisation / ridge regression
Lasso -- the absolute fine L1 regularisation / lasso regression
the knob a (alpha) regularisation strength (alpha / lambda)
bump on the diagonal (X^T X + alpha*I)^-1 X^T y
snap a weak dial to zero feature selection / sparsity
count the survivors number of non-zero coefficients
the 5-slice check k-fold cross-validation
the valley the CV error curve / bias-variance sweet spot
too stiff vs too jumpy the bias-variance trade-off
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
Two ready-made machines do all of this: RidgeCV and LassoCV both sweep a ladder of a's
with the 5-slice check and keep the winner. Scale the columns first, the same as every
chapter before.
>> NEW TO PYTHON? Each named once:
np.logspace(-3, 3, 50) -- 50 values from 10^-3 to 10^3, spaced evenly in powers
Model(...).fit(X, y) -- build the machine and set its dials on the study pile
model.coef_ != 0 -- a yes/no for each dial: did it survive? (True = non-zero)
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Ridge: sweep 50 a's, keep the valley. store_cv_values lets us read the curve later.
ridge_alphas = np.logspace(-3, 3, 50)
ridge_cv = RidgeCV(alphas=ridge_alphas, store_cv_values=True)
ridge_cv.fit(X_train_scaled, y_train)
ridge_best_alpha = float(ridge_cv.alpha_)
# Lasso: its own a-ladder, 5 folds, more loop-room (max_iter) for the snap to settle.
lasso_alphas = np.logspace(-3, 1, 50)
lasso_cv = LassoCV(alphas=lasso_alphas, cv=5, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)
lasso_best_alpha = float(lasso_cv.alpha_)
n_nonzero_lasso = int(np.sum(lasso_cv.coef_ != 0)) # the survivors
# the error-vs-a curves the lab asks for
ridge_cv_mse_mean = ridge_cv.cv_values_.mean(axis=0)
lasso_cv_mse_mean = lasso_cv.mse_path_.mean(axis=1)
!! WARN: RIDGECV QUIETLY DROPS THE CURVE IF YOU HAND IT A CUSTOM CV
RidgeCV will only save its error curve (cv_values_) when it uses its own built-in
leave-one-out check. The moment you pass it a custom splitter -- cv=kf -- it silently
stops storing those values, and the later line ridge_cv.cv_values_ crashes with an
AttributeError. The fix is to NOT pass cv=kf here; pass store_cv_values=True and let
RidgeCV use its default. (LassoCV has no such quirk -- cv=5 is fine there.)
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 4 -- Humble Dials and Wobble Bands):
Part 1 (this post) .
Part 2 -- One Dial Is a Lie .
Part 3 -- The Dial by Hand
Previous chapter: Chapter 3 -- Sorting Into Bins
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================