==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
Plain notes on how machine-learning rules actually work. Each one is stripped of
jargon, drawn first as a picture, solved by pencil, then written in code -- with the
textbook labels stuck on only at the very end. No hype, no frameworks, no JavaScript.
This is not a pile of loose articles -- it is a short BOOK in ten chapters (twenty-eight posts),
meant to be read top to bottom. Each post opens with a "post N of 28" path header and a
next-arrow, so you can read the whole thing front to back without ever hunting for what
comes next. Each chapter draws its ideas by hand first; the runnable Python is gathered
at the end of every post, and assumes almost no Python. An appendix ties it together.
>> NEW HERE? Start at post 1 of 28:
Guessing House Prices, End to End
No prior knowledge required. Just follow the pencil, and follow the next-arrows.
>> THE 1950 CONTRACT: print these pages and work them with a pencil -- nothing in the
teaching ever needs a computer. Where arithmetic gets heavy, imagine a room of
tireless clerks; every cost is counted in clerk-steps, and every number is
recomputed where it is needed, never remembered. Full contract on the
about page.
>> SPEAK FLUENT BUZZWORD? The Glossary / Decoder Ring translates every
plain term on this blog into its textbook name -- and back again.
CHAPTER 1 . PREDICTING HOUSE PRICES -- supervised regression from scratch
-------------------------------------------------------------------------
Part 1 -- Guessing House Prices, End to End
post 1/28 . 2026-06-01 . part 1 of 3 . fundamentals, supervised-learning
The whole picture on the California housing pile: why 20% of rows are locked away
first, the ask-closest rule and the straight-stick rule introduced plainly, RMSE
against the always-average baseline, and picking settings by rotating folds.
Part 2 -- Ask the Closest Rows: Gap, Same-Ruler, and How Slow It Gets
post 2/28 . 2026-06-01 . part 2 of 3 . knn, complexity, scaling
The ask-closest rule from the gap formula up: why a column measured in thousands
runs the whole show and must be put on the same ruler, the cost per guess and why
close stops meaning close as columns pile up, k as the stiff-vs-jumpy dial.
Part 3 -- The Straight-Stick Rule: Dials, Leftovers, and Numerics
post 3/28 . 2026-06-01 . part 3 of 3 . linear-regression, least-squares, numerics
From the bowl-shaped squared leftover to the exact dials: a three-row pencil solve
of the flat-point equations, right-angle leftovers and the flat-shadow thrower, the
Gauss-Markov conditions, and why you cut the sheet (QR/SVD) rather than flip it.
CHAPTER 2 . GRADING A GUESSER -- how good is the guess?
------------------------------------------------------
Part 1 -- Two Rulers for One Guess: MSE and R^2
post 4/28 . 2026-06-04 . part 1 of 2 . evaluation, mse, r-squared
The same straight-stick rule on a sheet of cars, graded two ways: MSE (RMSE with the
root left off) and R^2 -- a race against a fool who only ever shouts the average.
Why you divide by the whole wobble, why test R^2 can dip below zero, and the
stiff-vs-jumpy split of the miss-size.
Part 2 -- Reading the Dials: What the Coefficients Say
post 5/28 . 2026-06-04 . part 2 of 2 . coefficients, interpretation, scaling
Each column keeps one dial -- plus pushes up, minus drags down. Zip the names back
on, find the most-negative (a MIN, not a MAX), and watch the trap: raw dials wear
each column's own units, so they are not comparable until the columns sit on one
shared ruler.
CHAPTER 3 . SORTING INTO BINS -- yes-or-no guessing, scored every way
--------------------------------------------------------------------
Part 1 -- The S-Curve, the Four-Box Table, and Why Accuracy Lies
post 6/28 . 2026-06-05 . part 1 of 4 . logistic-regression, classification, evaluation
The shift from "guess a number" to "guess a bin": the S-curve squash, the
cross-entropy leftover, and how any dial sum becomes a chance. Then the four-box
table of CAUGHT/ALARM/MISSED/CLEAR, four scores that read different corners of it,
and why a machine that catches nobody can still score 63% accuracy.
Part 2 -- The Trade Curve: Sliding the Cutoff and What AUC Measures
post 7/28 . 2026-06-05 . part 2 of 4 . roc-curve, auc, cutoff
One cutoff judges the machine at one spot; slide it through every position and you
get the trade curve (ROC). The area beneath it -- AUC -- compresses the whole curve
to one number: pick a random sick lump and a random well lump; AUC is the chance the
machine scored the sick one higher.
Part 3 -- Leash and Cloud: L2 Punishment and the Two-Cloud Wall
post 8/28 . 2026-06-05 . part 3 of 4 . l2-regularization, lda
Two ways to humble an over-confident machine. L2: add a squared-dial price to the
leftover so no single dial dominates (C = 1/lambda -- small C is a heavy squeeze).
LDA: forget rolling downhill; read the two cloud centres and their shared spread,
then solve for the wall in one closed-form step.
Part 4 -- Picking Settings, Skewed Piles, and Averaging Many Classes
post 9/28 . 2026-06-05 . part 4 of 4 . grid-search, precision-recall, class-imbalance
Grid-hunting settings by rotating folds; why pinch-to-fit scaling breaks around
outliers; the precision-vs-recall trade in fraud detection; precision-recall curves
for when one class is 100x rarer; and macro vs micro averaging for 3+ bins.
CHAPTER 4 . HUMBLE DIALS AND WOBBLE BANDS -- regularisation, and how much to trust a dial
----------------------------------------------------------------------------------------
Part 1 -- The Leash: Ridge, Lasso, and Humbling the Dials
post 10/28 . 2026-06-07 . part 1 of 3 . ridge, lasso, regularisation
Free dials grow huge and memorise; put them on a leash. Ridge (the square fine) shrinks
all dials and zeroes none, solving in one step. Lasso (the absolute fine) snaps weak
dials to exactly zero and hands you a shortlist of columns. Pick the knob by the valley.
Part 2 -- One Dial Is a Lie: Bootstrap, Wobble Bands, and the Free Exam
post 11/28 . 2026-06-07 . part 2 of 3 . bootstrap, confidence-interval, out-of-bag
A single dial value hides how shaky it is. Fake 200 hospital visits by re-dealing the
pile with repeats, and read off a 95% wobble band per dial (chop 2.5% each end). The
~37% left out of each deal become a free, honest exam.
Part 3 -- The Dial by Hand: Where the Dials Really Come From
post 12/28 . 2026-06-07 . part 3 of 3 . least-squares, collinearity, derivation
The whole machine on a blank sheet: the dials are (X^T X)^-1 X^T y, the untangler that
shares credit between overlapping columns. The one-column slope is a LADDER that matches
only when columns don't overlap -- name it out loud before you build anything on it.
CHAPTER 5 . QUESTION CHARTS AND COMMITTEES -- trees, pruning, and ensembles
--------------------------------------------------------------------------
Part 1 -- Question Charts: Building a Tree by Hand
post 13/28 . 2026-06-07 . part 1 of 3 . decision-tree, regression, classification
No dials, no formula: a machine that asks yes/no questions. Built from scratch --
sort each column, try every midpoint, pick the cleanest cut, recurse. Five wrong
pictures corrected: one column per split, depth is people not columns, build and
use are separate phases, same column may recur at a tighter cut.
Part 2 -- The Mixing Ruler: Gini, Information Gain, and Pruning
post 14/28 . 2026-06-07 . part 2 of 3 . gini, information-gain, pruning
Flip to the cancer sick/well sheet: the badness ruler swaps from squared misses to
Gini (chance two random grabs disagree) -- derived by hand from four counts. Then
pruning: a tax per leaf snips weak branches after the tree is fully grown; the alpha
menu comes from the tree itself; the nested-averages ladder picks the best tax.
Part 3 -- Committees: Bagging, Random Forest, and Boosting
post 15/28 . 2026-06-07 . part 3 of 3 . bagging, random-forest, boosting, ensembles
One chart memorises and panics; 200 averaged charts are eerily steady. Bagging
re-deals with repeats and averages 200 trees (OOB free exam built in). Random forest
hides a handful of columns at each cut so the trees truly differ. Boosting chains
stumps in a line, each fixing the last one's leftovers. Two ways to interrogate the
black box: scramble a column (permutation importance) and slide a column (PDP/ICE).
APPENDICES . FLIP-TO REFERENCES -- every term in one place
----------------------------------------------------------
A. Classification Reference -- Loss, Leash, Grid, and All the Terms
reference . 2026-06-06 . companion to Chapter 3 . log-loss, lda, grid-search, pr-curves
All classification concepts from Chapter 3 in one flip-to reference: cross-entropy vs
MSE, the C parameter, LDA with priors, hyperparameters vs dials, GridSearchCV fold
safety, min-max scaling, precision vs recall in business, ROC vs PR curves, skewed
piles, and macro/micro/weighted averaging -- plain language first, standard labels
last. Read Chapter 3 first; use this as a reference when you need to look something
up quickly.
B. Distance and Clustering Reference -- Rulers, Traps, and Ethics
reference . 2026-06-09 . companion to Chapter 6 . hamming, mahalanobis, missing-data, ethics
The loose ends Chapter 6 does not stop for: the k you pick, two more rulers (Hamming for
words, Mahalanobis for stretched many-walled rooms), the missing-data traps, why crush a
room at all, and the part the lab skips -- the ethics of sorting PEOPLE into piles (bias,
privacy, transparency) and what segmentation is really for.
C. LSTM From Pencil -- RNN and LSTM From Scratch, Nothing But a Pencil
reference . 2026-06-14 . companion to Chapter 10 . lstm, rnn, by-hand, kata
A KATA: one clean, self-contained walk through the RNN and the LSTM by pencil, made to be
redone on a blank sheet again and again until your hand knows it. Words into 32-number
rows, grid-times-row by hand, the RNN's one-memory recipe and why it FADES (a theorem),
then the two-memory fix -- a silent uncrushed vault A and a spoken readout B, four machines
(FRESH + keep/admit/show), the combine worked in real numbers (5.0 -> 4.62 -> 0.70) -- plus
a struggle log of every trap and a one-word end-to-end recitation. No computer, no code.
CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS -- unsupervised learning
----------------------------------------------------------------------
Part 1 -- Looking at a Sheet With No Answers: Means, Distance, and the Ruler Problem
post 16/28 . 2026-06-09 . part 1 of 6 . eda, distance, standardization, unsupervised
No answer column, no right or wrong -- just 50 states and 3 measurements. The mean
and spread of each column, the ruler problem (Assault in hundreds drowns UrbanPop
in tens), two gap rulers (straight-line Euclidean and city-block Manhattan), the
sheet of gaps, and the fix: put every column on the same ruler and measure again.
Part 2 -- The Strongest Direction: Crushing a Many-Wall Room Into a Flat Page (PCA)
post 17/28 . 2026-06-09 . part 2 of 6 . pca, dimensionality-reduction, visualization
13 chemical measurements = 13 walls. You cannot draw a 13-wall room. Shine a
flashlight, trace the longest shadow (PC1), then the next at a right angle (PC2).
A 2-column by-hand worked example, the recipe (loadings), how much each shadow
carries (PVE), choosing how many to keep (80% threshold), and blowing the shadow
back up (reconstruction error).
Part 3 -- Grouping by Nearest Centre: K-Means From a Blank Sheet
post 18/28 . 2026-06-09 . part 3 of 6 . k-means, clustering, unsupervised
The first machine that actually GROUPS: pick K, then loop two moves -- assign every dot
to its nearest centre, move each centre to its pile's mean. Tightness as the score, the
unlucky-start trap, and picking K by the elbow.
Part 4 -- The Family Tree: Hierarchical Clustering and the Dendrogram
post 19/28 . 2026-06-09 . part 4 of 6 . hierarchical-clustering, dendrogram, linkage
No K up front: start with everyone alone and marry the two closest groups over and over,
recording each wedding's height. Read the family tree, cut it wherever you like, and the
cut decides how many groups. Single, complete, average, and Ward linkage compared.
Part 5 -- Both Tools on NCI60: PCA and Clustering on Real Gene Data
post 20/28 . 2026-06-09 . part 5 of 6 . pca, clustering, nci60, case-study
All three tools on one hard sheet: 64 samples, 6,830 genes -- more columns than rows.
Crush with PCA, carve with K-means and the family tree, then unseal the cancer types and
grade the blind groups. Peeking at labels is a scorecard, never an input.
Part 6 -- Filling the Blanks: Recommender Systems and Matrix Factorisation
post 21/28 . 2026-06-09 . part 6 of 6 . recommender-systems, matrix-factorisation
A sheet that is mostly holes: users x movies, a rating only where someone watched. Split
the holey grid into two skinny full sheets of hidden tastes and flavours -- PCA's
scores-times-recipe shape again. Then the lab end to end: humble each movie column, drop
lazy zeros, and run rebuild-restore rounds until the marks freeze -- past the hand recipe
that died of its own echo.
CHAPTER 7 . BUILDING A NEURAL NETWORK FROM SCRATCH -- deep learning, forward then backward
-----------------------------------------------------------------------------------------
Part 1 -- Stacked Rooms and One Walk by Hand: How a Network Computes a Guess
post 22/28 . 2026-06-11 . part 1 of 2 . neural-network, deep-learning, relu, sigmoid
A fully self-contained start -- no prior posts needed. One room of clerks draws straight
lines; stacked rooms draw anything, but only if a non-linear bend (the zero-out rule,
ReLU) breaks the collapse between them. The forward pass walked by hand: three-cut split,
humbling the columns, every multiplication of one patient shown, the S-curve derived from
odds, and cross-entropy loss. Ends with a built network whose dials have not yet moved.
Part 2 -- Rolling Downhill by Hand: How a Network Learns
post 23/28 . 2026-06-11 . part 2 of 2 . backpropagation, gradient-descent, adam, dropout
The dials finally learn. Backpropagation worked by hand on a one-input, one-hidden,
one-output network: the chain rule link by link, the (guess - truth) shortcut derived,
every slope checked against a brute-force wiggle, then the error sent one room further
back through the dead-clerk gate. Learning rate, Adam, dropout as co-adaptation, the
overflow clip, three real debugging mistakes, and the end-to-end run (0.974 exam).
CHAPTER 8 . KEEPING A NETWORK HONEST -- the fight against memorising
-------------------------------------------------------------------
Five Machines Against Memorising: A Tax, a Coffee Break, a Fire Alarm, and a Humbler
post 24/28 . 2026-06-12 . overfitting, regularisation, dropout, batchnorm
A network that learns too well memorises the study pile's freckles and flunks new
patients. One plain machine catches the disease on a sheet of clothing photos; four cures
each fight it differently -- a tax on big dials (L2), sending clerks home (dropout), a fire
alarm that stops at the valley floor (early stopping), and a humbler between floors (batch
norm). Softmax and its ten chances derived by hand, then all five judged on one sealed
pile, with the study-minus-practice gap naming the biggest memoriser.
CHAPTER 9 . MACHINES THAT LOOK AT PICTURES -- convolution, by hand
-----------------------------------------------------------------
Part 1 -- A Magic Paper Slid Over a Photo: How a Picture Network Sees
post 25/28 . 2026-06-12 . part 1 . cnn, convolution, pooling, computer-vision
Flatten a photo and you destroy every "next to" -- so this machine keeps the picture folded
and slides a tiny 3x3 window (a magic paper of nine dials) across it, lighting up where its
shape sits. Built one worker at a time with every wrong picture corrected: the edge-finder
that scores 60 on an edge and 0 on flat grey, 27 dials across three colours, 32 inspectors
into 32 score-sheets, the shrink boss that keeps the loudest of each 2x2, the second floor
drilling through all 32, and the punchline -- 262,208 of 282,250 dials hide in one Dense
floor because the small papers are reused everywhere. (Colour photos, CIFAR-10.)
Part 2 -- The Deep Factory: Humbler, Send-Home, and the Confusion Sheet
post 26/28 . 2026-06-13 . part 2 . cnn, batch-norm, dropout, confusion-matrix
The simple factory lands ~70%. Three layers of armour push it further: a humbler steadies
every inspector's 65,536 numbers to middle 0, scatter 1 (arithmetic done by hand, full
clerk count); send-home zeroes a random 25% of sheet entries to break secret teams (one
coin per entry, off at exam time); the deep factory inserts both in the right order --
floor → humbler → boss → send-home. Then the confusion sheet: sort 10,000 test cards
into 10 true-animal piles, sort each pile by guess, and the biggest off-diagonal cell
names cat↔dog as the most-confused pair. Ends with reading the magic-paper shape (3,3,3,32)
directly from the trained floor.
CHAPTER 10 . MACHINES THAT READ WORDS -- recurrent networks, by hand
-------------------------------------------------------------------
Part 1 -- Words Into a Machine: The Notepad and the Walking Worker
post 27/28 . 2026-06-13 . part 1 of 2 . rnn, embedding, nlp, sequence-models
A photo was already numbers; a review is WORDS, and a factory can only multiply numbers.
Number the 10,000 commonest words by frequency, pad every review to 100, then a NOTEPAD
swaps each word-number for a learned 32-number note (so "boring" and "dull" drift close,
and "not" can carry a flip). One walking worker reads the 100 notes IN ORDER, reusing one
dial-set to rewrite a 32-number memory -- the RNN cell, walked by hand word by word -- and
a final S-curve clerk reads the last memory for thumbs up or down. (IMDB sentiment.)
Part 2 -- The Two-Memory Worker: How an LSTM Remembers Far-Back Words
post 28/28 . 2026-06-13 . part 2 of 2 . lstm, bilstm, sequence-models
The plain worker crushes its one memory every word, so word 1 fades to nothing by word 90.
The fix, derived by hand: carry TWO memories -- a long keep (A, never crushed, free to grow
past 1) and a spoken recent (B) -- and let three 0..1 voters decide what to keep, admit, and
speak. That is the LSTM (the old RNN cell plus three dimmer knobs), shown with a full
real-number combine. Then the BiLSTM reads the review both directions and glues the two
memories, an honest four-way comparison, and a blunt note: the fade is a theorem, the exact
wiring is an engineered choice.
----------------------------------------------------------------------------------------------
the method: ban the jargon . draw it first . solve by pencil . then write the code
built with: pandas, scikit-learn, NumPy, and a pencil . pure HTML+CSS, no JavaScript
follow along: subscribe via RSS . dark mode follows your system setting
(c) 2026 Rahul Rai . no cookies, no trackers . source on GitHub
==============================================================================================