==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 4 . HUMBLE DIALS AND WOBBLE BANDS . PART 2 OF 3
One Dial Is a Lie: Bootstrap, Wobble Bands, and the Free Exam
Posted: 2026-06-07 . Author: Rahul Rai . Tags: bootstrap, confidence-interval, out-of-bag
============================================================================================
PATH . post 11 of 28
<- prev: Chapter 4, Part 1: The Leash
next: Chapter 4, Part 3: The Dial by Hand ->
Part 1 made the dials humble. This post makes them HONEST -- and the honesty stings,
because the first thing it admits is that a single dial value, reported on its own, is a
small lie.
Here is the lie. The machine studies my patients and announces a bmi-dial of 0.52. It
says it flatly, with a straight face, as if 0.52 were carved in stone. But I only had
THESE eighty-odd patients. Walk into the hospital on a different morning, scoop up eighty
DIFFERENT patients, and the dial comes out 0.44, or 0.58, or who knows. So what is 0.52
really worth? The number alone will not say. We need to know how much it would jump around
if the world had dealt us a slightly different crowd.
## I Cannot Visit the Hospital 200 Times
The honest fix would be to gather two hundred fresh crowds and watch the dial dance. I
cannot do that -- I have one pile of patients and that is all. So I FAKE it. I build new
crowds out of the pile I already own, by re-dealing it WITH REPEATS:
pull a patient, write them down, PUT THEM BACK in the bag
do this until the fake pile is the same size as the real one
-> some patients get drawn twice or three times; some never get drawn at all
rebuild the dials on this fake pile
repeat the whole performance 200 times -> 200 versions of every dial
Count the price in clerk-steps: the straight-stick rule did ONE full dial-solve; the
bootstrap does 200 of them, one per fake pile of the same 442 people. 200 complete
solves where there was one -- trivial for a room of tireless clerks, unthinkable for
a single pencil, which is the whole reason this trick waited for a crowd to share it.
This trick has a name we will earn at the end: the bootstrap. The whole engine is "put
them back." Drawing with replacement is what makes each fake pile genuinely different from
the last, and that difference is what mimics walking into the hospital on a new day.
WHY is the trick allowed? Name the one assumption it leans on. A fresh crowd would
be drawn from the world of all possible patients -- a bag I cannot reach. The pile I
own is the best stand-in for that bag I will ever hold: every patient in it WAS drawn
from that world, in the proportions the world coughed up. So drawing from my pile,
with repeats, imitates drawing from the world -- imperfectly, and the imperfection
shrinks as the pile grows. With 4 patients the fake crowds are a crude puppet of the
world; with 442 they are a decent one. That is the whole licence: the pile stands in
for the world because the world is where the pile came from. No theorem hides here --
if the pile was gathered crookedly (one clinic, one age group), the fakes inherit the
crookedness, and no amount of re-dealing washes it out.
>> NOTE: WHY ABOUT 37% SIT OUT EVERY DEAL
Putting each patient back means some never get picked. By pencil, for a pile of
80 people:
chance one patient is MISSED in a single draw = (80 - 1)/80 = 79/80
chance they are missed in ALL 80 draws = (79/80)^80
Computing (79/80)^80 by repeated squaring (pencil and paper):
(79/80)^2 = 6241/6400 ~ 0.975 square it
(79/80)^4 = 0.975 * 0.975 ~ 0.951 square again
(79/80)^8 = 0.951 * 0.951 ~ 0.904 and again
(79/80)^16 = 0.904 * 0.904 ~ 0.817 and again
(79/80)^32 = 0.817 * 0.817 ~ 0.668 and again
(79/80)^64 = 0.668 * 0.668 ~ 0.446 and again
80 draws = 64 + 16 (since 80 = 64 + 16)
So (79/80)^80 = (79/80)^64 * (79/80)^16
~ 0.446 * 0.817
~ 0.364
About 37%. For any decent-sized pile the answer always lands near
1/e ~ 0.37. So roughly a THIRD of the pile sits out every single
deal -- and those left-out people turn out to be a gift (see "The
Free Exam" below).
## The Wobble Band
IN HAND: one real pile, re-dealt WITH REPEATS into 200 fake piles (the pile stands in
for the world because the world is where the pile came from), each fake pile refit into
its own version of every dial -- so each dial now has 200 values, not one. This section
turns those 200 values into a trust range.
After 200 re-deals, one dial -- say bmi -- has 200 different values written down. Sort
them smallest to largest. Chop the extreme 2.5% off each end. What is left in the middle
is the 95% WOBBLE BAND:
0.38 0.40 0.42 ... 0.56 0.58 0.60
+chop-+ keep the middle 190 +chop-+
[ 0.42 --------- 0.58 ] <- the 95% band
The WIDTH of that band is exactly how much to trust the dial:
tight band, well away from 0 -> trust it
wide band -> real, but shaky
band CROSSES 0 -> might be nothing -- do not even trust its sign
That last line is the sharpest tool in the box. If a dial's band runs from -0.80 to +0.90,
the machine cannot even tell you whether that column pushes the answer UP or DOWN. The
honest report is: "this column might be doing nothing at all."
!! WARN: 95% MEANS CHOP 2.5% EACH END -- NOT 25%
It is dangerously easy to reach for the quartiles from way back in Chapter 1 and chop
25% off each end. That is the wrong band by a mile.
middle 50% (quartiles): chop 25% each end -> quantile 0.25 and 0.75
middle 95% (this band): chop 2.5% each end -> quantile 0.025 and 0.975
Use 0.025 and 0.975 (because 100% - 2.5% = 97.5%). Reach for 0.25 and 0.75 and your
band comes out absurdly narrow, and you will throw away perfectly good dials as
"useless." We want to be almost-certain, not merely probably.
By pencil, for 200 values:
2.5% of 200 = 200 * 2.5/100 = 200 * 0.025 = 5
So chop the 5 smallest values and the 5 largest. Keep index 5 through
194 (counting from 0). That's 200 - 5 - 5 = 190 values in the middle.
The number on the 5th index (0.42) is the lower edge. The number on the
194th index (0.58) is the upper edge. The band = 0.42 to 0.58.
>> YOUR TURN
Suppose you ran 400 re-deals instead of 200 (made-up). How many dial values do
you chop off EACH end for the 95% band, and how many stay in the middle?
check your slate: 2.5% of 400 = 400 x 0.025 = 10; chop 10 smallest and 10
largest; 400 - 10 - 10 = 380 stay in the middle. More re-deals, a smoother
band -- but always 2.5% sliced off each tail, never 25%.
!! WARN: COUNTING NOTES IS NOT MEASURING WIDTH
Every dial's 95% band keeps the same NUMBER of values in the middle -- 190 of the 200,
always. That count tells you nothing. The trust is in the WIDTH: the gap between the
number on the 5th note and the number on the 195th.
bmi-dial: 0.42 to 0.58 -> width 0.16 (narrow, steady, trustworthy)
s5-dial: -0.80 to +0.90 -> width 1.70 (wide, crosses 0, untrustworthy)
>> YOUR TURN
Two more dials (made-up): age-dial band 0.20 to 0.60; bp-dial band -0.30 to
+0.50. Work each width, and say which you can trust.
check your slate: age width = 0.60 - 0.20 = 0.40, and the whole band sits ABOVE
zero -> trust it (a real push up). bp width = 0.50 - (-0.30) = 0.80, and the band
STRADDLES zero -> cannot even trust its sign; bp might be doing nothing at all.
Same 190 notes in both. Wildly different trust. Never confuse the count with the gap.
## The Free Exam (Out-of-Bag)
Now the gift. Remember that ~37% of patients sit out of every deal. The machine never saw
them while it set its dials -- which makes them a perfect, honest, FREE EXAM. No need to
touch the sealed exam pile at all.
for each of the 200 re-deals:
build the dials on the PICKED patients
grade on the LEFT-OUT patients -> one honest error (RMSE)
-> 200 honest errors -> average them, and chop 2.5% each end for a band
The entire trick is grading on the left-out people, not on the ones the machine just
studied. Grade it on its own working crowd and the score is a fake -- flattering and
over-confident, the same self-graded-exam lie from Chapter 1. The left-out third keeps it
honest for free.
** KEY: THE BAND IS THE REPORT, NOT THE SINGLE NUMBER
One dial value is a point; the wobble band is the truth around it. A wide band on the
error means the machine itself is shaky -- it would guess differently if the patients
had been different. Reporting the band instead of the bare number is the difference
between confidence and honesty.
## The Tug-of-War Behind the Whole Chapter
Step all the way back. Everything in this chapter -- the leash in Part 1, the wobble band
here -- is one single trade-off, and it is easiest to see as two bad drivers:
STUBBORN driver (too stiff -- "bias"):
drives dead straight, ignores every curve in the road
-> never panics, but calmly drives into the wrong building
PANICKY driver (too jumpy -- "variance"):
memorises every pebble and parked car it saw on Monday
-> Tuesday the car has moved, it panics, drives into a lake
A free-spinning machine is the panicky driver. Crushing its dials with Ridge or Lasso
makes it a little more stubborn ON PURPOSE -- trading away the panic to land in the sweet
spot between the two. The knob a from Part 1 is exactly where you choose to sit on that
line; the 5-slice valley is how you find the sweet spot; and the wobble band here is how
you measure the panic that is left over.
## The Labels, Last
Plain term used above Standard label
----------------------------------- ------------------------------------------
re-deal with repeats bootstrap / sampling with replacement
200 versions of a dial the bootstrap distribution
the 95% wobble band 95% confidence interval
chop 2.5% each end the 2.5th and 97.5th percentiles
band crosses zero not statistically distinguishable from 0
the ~37% left out out-of-bag (OOB) sample
the free exam out-of-bag error estimate
stubborn driver / too stiff high bias
panicky driver / too jumpy high variance
the whole trade the bias-variance trade-off
## The Code, If You Want It
Nothing above needed a computer -- only pencils, clerks, and patience. This last
section is for the day you meet one: the same steps, spoken in Python.
Two small loops. The first re-deals the pile 200 times and collects every dial; the band
is just two quantiles of those. The second does the same but grades each deal on its
left-out patients.
>> NEW TO PYTHON? Each named once:
rng.choice(n, size=n, replace=True) -- draw n patients WITH repeats (the re-deal)
df.quantile(0.025) -- the value 2.5% of the way up the sorted pile
~np.isin(everyone, picked) -- the ~ flips True/False: here, "NOT picked"
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# --- wobble bands for the dials ---
def bootstrap_ols_coefficients(X_train_scaled, y_train, B=200):
coefs = []
rng = np.random.RandomState(RANDOM_STATE)
for _ in range(B):
idx = rng.choice(len(X_train_scaled), size=len(X_train_scaled), replace=True)
model = LinearRegression().fit(X_train_scaled[idx], y_train.iloc[idx])
coefs.append(model.coef_)
coef_bootstrap_df = pd.DataFrame(coefs, columns=X.columns)
lower = coef_bootstrap_df.quantile(0.025) # chop 2.5% off the bottom
upper = coef_bootstrap_df.quantile(0.975) # chop 2.5% off the top
coef_ci_95 = pd.DataFrame({'lower': lower, 'upper': upper})
return coef_bootstrap_df, coef_ci_95
# --- the free exam: out-of-bag error ---
def bootstrap_oob_rmse_ols(X_train_scaled, y_train, B=200):
rmses = []
rng = np.random.RandomState(RANDOM_STATE)
for _ in range(B):
idx = rng.choice(len(X_train_scaled), size=len(X_train_scaled), replace=True)
oob = ~np.isin(np.arange(len(X_train_scaled)), idx) # everyone NOT picked
if oob.sum() == 0:
continue
model = LinearRegression().fit(X_train_scaled[idx], y_train.iloc[idx])
guesses = model.predict(X_train_scaled[oob]) # grade on the LEFT-OUT
rmses.append(np.sqrt(mean_squared_error(y_train.iloc[oob], guesses)))
rmses = np.array(rmses)
rmse_oob_mean = float(rmses.mean())
rmse_oob_ci95 = (float(np.quantile(rmses, 0.025)), float(np.quantile(rmses, 0.975)))
return rmse_oob_mean, rmse_oob_ci95
!! WARN: THE ~ IS THE WHOLE FREE EXAM
np.isin marks True for everyone who WAS drawn into the deal. The ~ flips it, leaving
True for everyone who was NOT. Drop the ~ and you grade the machine on the very
patients it just studied -- the self-graded exam again, fake and over-confident. The
one little squiggle is the difference between an honest score and a lie.
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 4 -- Humble Dials and Wobble Bands):
Part 1 -- The Leash .
Part 2 (this post) .
Part 3 -- The Dial by Hand
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================