Rahul's ML Blog

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  Plain notes on how machine-learning rules actually work. Each one is stripped of
  jargon, drawn first as a picture, solved by pencil, then written in code -- with the
  textbook labels stuck on only at the very end. No hype, no frameworks, no JavaScript.

  This is not a pile of loose articles -- it is a short BOOK in ten chapters (twenty-eight posts),
  meant to be read top to bottom. Each post opens with a "post N of 28" path header and a
  next-arrow, so you can read the whole thing front to back without ever hunting for what
  comes next. Each chapter draws its ideas by hand first; the runnable Python is gathered
  at the end of every post, and assumes almost no Python. An appendix ties it together.

  >> NEW HERE? Start at post 1 of 28:
     Guessing House Prices, End to End
     No prior knowledge required. Just follow the pencil, and follow the next-arrows.

  >> THE 1950 CONTRACT: print these pages and work them with a pencil -- nothing in the
     teaching ever needs a computer.  Where arithmetic gets heavy, imagine a room of
     tireless clerks; every cost is counted in clerk-steps, and every number is
     recomputed where it is needed, never remembered.  Full contract on the
     about page.

  >> SPEAK FLUENT BUZZWORD? The Glossary / Decoder Ring translates every
     plain term on this blog into its textbook name -- and back again.


  CHAPTER 1 . PREDICTING HOUSE PRICES -- supervised regression from scratch
  -------------------------------------------------------------------------

  Part 1 -- Guessing House Prices, End to End
    post 1/28 . 2026-06-01 . part 1 of 3 . fundamentals, supervised-learning
    The whole picture on the California housing pile: why 20% of rows are locked away
    first, the ask-closest rule and the straight-stick rule introduced plainly, RMSE
    against the always-average baseline, and picking settings by rotating folds.

  Part 2 -- Ask the Closest Rows: Gap, Same-Ruler, and How Slow It Gets
    post 2/28 . 2026-06-01 . part 2 of 3 . knn, complexity, scaling
    The ask-closest rule from the gap formula up: why a column measured in thousands
    runs the whole show and must be put on the same ruler, the cost per guess and why
    close stops meaning close as columns pile up, k as the stiff-vs-jumpy dial.

  Part 3 -- The Straight-Stick Rule: Dials, Leftovers, and Numerics
    post 3/28 . 2026-06-01 . part 3 of 3 . linear-regression, least-squares, numerics
    From the bowl-shaped squared leftover to the exact dials: a three-row pencil solve
    of the flat-point equations, right-angle leftovers and the flat-shadow thrower, the
    Gauss-Markov conditions, and why you cut the sheet (QR/SVD) rather than flip it.


  CHAPTER 2 . GRADING A GUESSER -- how good is the guess?
  ------------------------------------------------------

  Part 1 -- Two Rulers for One Guess: MSE and R^2
    post 4/28 . 2026-06-04 . part 1 of 2 . evaluation, mse, r-squared
    The same straight-stick rule on a sheet of cars, graded two ways: MSE (RMSE with the
    root left off) and R^2 -- a race against a fool who only ever shouts the average.
    Why you divide by the whole wobble, why test R^2 can dip below zero, and the
    stiff-vs-jumpy split of the miss-size.

  Part 2 -- Reading the Dials: What the Coefficients Say
    post 5/28 . 2026-06-04 . part 2 of 2 . coefficients, interpretation, scaling
    Each column keeps one dial -- plus pushes up, minus drags down. Zip the names back
    on, find the most-negative (a MIN, not a MAX), and watch the trap: raw dials wear
    each column's own units, so they are not comparable until the columns sit on one
    shared ruler.


  CHAPTER 3 . SORTING INTO BINS -- yes-or-no guessing, scored every way
  --------------------------------------------------------------------

  Part 1 -- The S-Curve, the Four-Box Table, and Why Accuracy Lies
    post 6/28 . 2026-06-05 . part 1 of 4 . logistic-regression, classification, evaluation
    The shift from "guess a number" to "guess a bin": the S-curve squash, the
    cross-entropy leftover, and how any dial sum becomes a chance. Then the four-box
    table of CAUGHT/ALARM/MISSED/CLEAR, four scores that read different corners of it,
    and why a machine that catches nobody can still score 63% accuracy.

  Part 2 -- The Trade Curve: Sliding the Cutoff and What AUC Measures
    post 7/28 . 2026-06-05 . part 2 of 4 . roc-curve, auc, cutoff
    One cutoff judges the machine at one spot; slide it through every position and you
    get the trade curve (ROC). The area beneath it -- AUC -- compresses the whole curve
    to one number: pick a random sick lump and a random well lump; AUC is the chance the
    machine scored the sick one higher.

  Part 3 -- Leash and Cloud: L2 Punishment and the Two-Cloud Wall
    post 8/28 . 2026-06-05 . part 3 of 4 . l2-regularization, lda
    Two ways to humble an over-confident machine. L2: add a squared-dial price to the
    leftover so no single dial dominates (C = 1/lambda -- small C is a heavy squeeze).
    LDA: forget rolling downhill; read the two cloud centres and their shared spread,
    then solve for the wall in one closed-form step.

  Part 4 -- Picking Settings, Skewed Piles, and Averaging Many Classes
    post 9/28 . 2026-06-05 . part 4 of 4 . grid-search, precision-recall, class-imbalance
    Grid-hunting settings by rotating folds; why pinch-to-fit scaling breaks around
    outliers; the precision-vs-recall trade in fraud detection; precision-recall curves
    for when one class is 100x rarer; and macro vs micro averaging for 3+ bins.


  CHAPTER 4 . HUMBLE DIALS AND WOBBLE BANDS -- regularisation, and how much to trust a dial
  ----------------------------------------------------------------------------------------

  Part 1 -- The Leash: Ridge, Lasso, and Humbling the Dials
    post 10/28 . 2026-06-07 . part 1 of 3 . ridge, lasso, regularisation
    Free dials grow huge and memorise; put them on a leash. Ridge (the square fine) shrinks
    all dials and zeroes none, solving in one step. Lasso (the absolute fine) snaps weak
    dials to exactly zero and hands you a shortlist of columns. Pick the knob by the valley.

  Part 2 -- One Dial Is a Lie: Bootstrap, Wobble Bands, and the Free Exam
    post 11/28 . 2026-06-07 . part 2 of 3 . bootstrap, confidence-interval, out-of-bag
    A single dial value hides how shaky it is. Fake 200 hospital visits by re-dealing the
    pile with repeats, and read off a 95% wobble band per dial (chop 2.5% each end). The
    ~37% left out of each deal become a free, honest exam.

  Part 3 -- The Dial by Hand: Where the Dials Really Come From
    post 12/28 . 2026-06-07 . part 3 of 3 . least-squares, collinearity, derivation
    The whole machine on a blank sheet: the dials are (X^T X)^-1 X^T y, the untangler that
    shares credit between overlapping columns. The one-column slope is a LADDER that matches
    only when columns don't overlap -- name it out loud before you build anything on it.


  CHAPTER 5 . QUESTION CHARTS AND COMMITTEES -- trees, pruning, and ensembles
  --------------------------------------------------------------------------

  Part 1 -- Question Charts: Building a Tree by Hand
    post 13/28 . 2026-06-07 . part 1 of 3 . decision-tree, regression, classification
    No dials, no formula: a machine that asks yes/no questions. Built from scratch --
    sort each column, try every midpoint, pick the cleanest cut, recurse. Five wrong
    pictures corrected: one column per split, depth is people not columns, build and
    use are separate phases, same column may recur at a tighter cut.

  Part 2 -- The Mixing Ruler: Gini, Information Gain, and Pruning
    post 14/28 . 2026-06-07 . part 2 of 3 . gini, information-gain, pruning
    Flip to the cancer sick/well sheet: the badness ruler swaps from squared misses to
    Gini (chance two random grabs disagree) -- derived by hand from four counts. Then
    pruning: a tax per leaf snips weak branches after the tree is fully grown; the alpha
    menu comes from the tree itself; the nested-averages ladder picks the best tax.

  Part 3 -- Committees: Bagging, Random Forest, and Boosting
    post 15/28 . 2026-06-07 . part 3 of 3 . bagging, random-forest, boosting, ensembles
    One chart memorises and panics; 200 averaged charts are eerily steady. Bagging
    re-deals with repeats and averages 200 trees (OOB free exam built in). Random forest
    hides a handful of columns at each cut so the trees truly differ. Boosting chains
    stumps in a line, each fixing the last one's leftovers. Two ways to interrogate the
    black box: scramble a column (permutation importance) and slide a column (PDP/ICE).


  APPENDICES . FLIP-TO REFERENCES -- every term in one place
  ----------------------------------------------------------

  A. Classification Reference -- Loss, Leash, Grid, and All the Terms
    reference . 2026-06-06 . companion to Chapter 3 . log-loss, lda, grid-search, pr-curves
    All classification concepts from Chapter 3 in one flip-to reference: cross-entropy vs
    MSE, the C parameter, LDA with priors, hyperparameters vs dials, GridSearchCV fold
    safety, min-max scaling, precision vs recall in business, ROC vs PR curves, skewed
    piles, and macro/micro/weighted averaging -- plain language first, standard labels
    last. Read Chapter 3 first; use this as a reference when you need to look something
    up quickly.

  B. Distance and Clustering Reference -- Rulers, Traps, and Ethics
    reference . 2026-06-09 . companion to Chapter 6 . hamming, mahalanobis, missing-data, ethics
    The loose ends Chapter 6 does not stop for: the k you pick, two more rulers (Hamming for
    words, Mahalanobis for stretched many-walled rooms), the missing-data traps, why crush a
    room at all, and the part the lab skips -- the ethics of sorting PEOPLE into piles (bias,
    privacy, transparency) and what segmentation is really for.

  C. LSTM From Pencil -- RNN and LSTM From Scratch, Nothing But a Pencil
    reference . 2026-06-14 . companion to Chapter 10 . lstm, rnn, by-hand, kata
    A KATA: one clean, self-contained walk through the RNN and the LSTM by pencil, made to be
    redone on a blank sheet again and again until your hand knows it. Words into 32-number
    rows, grid-times-row by hand, the RNN's one-memory recipe and why it FADES (a theorem),
    then the two-memory fix -- a silent uncrushed vault A and a spoken readout B, four machines
    (FRESH + keep/admit/show), the combine worked in real numbers (5.0 -> 4.62 -> 0.70) -- plus
    a struggle log of every trap and a one-word end-to-end recitation. No computer, no code.


  CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS -- unsupervised learning
  ----------------------------------------------------------------------

  Part 1 -- Looking at a Sheet With No Answers: Means, Distance, and the Ruler Problem
    post 16/28 . 2026-06-09 . part 1 of 6 . eda, distance, standardization, unsupervised
    No answer column, no right or wrong -- just 50 states and 3 measurements.  The mean
    and spread of each column, the ruler problem (Assault in hundreds drowns UrbanPop
    in tens), two gap rulers (straight-line Euclidean and city-block Manhattan), the
    sheet of gaps, and the fix: put every column on the same ruler and measure again.

  Part 2 -- The Strongest Direction: Crushing a Many-Wall Room Into a Flat Page (PCA)
    post 17/28 . 2026-06-09 . part 2 of 6 . pca, dimensionality-reduction, visualization
    13 chemical measurements = 13 walls.  You cannot draw a 13-wall room.  Shine a
    flashlight, trace the longest shadow (PC1), then the next at a right angle (PC2).
    A 2-column by-hand worked example, the recipe (loadings), how much each shadow
    carries (PVE), choosing how many to keep (80% threshold), and blowing the shadow
    back up (reconstruction error).

  Part 3 -- Grouping by Nearest Centre: K-Means From a Blank Sheet
    post 18/28 . 2026-06-09 . part 3 of 6 . k-means, clustering, unsupervised
    The first machine that actually GROUPS: pick K, then loop two moves -- assign every dot
    to its nearest centre, move each centre to its pile's mean. Tightness as the score, the
    unlucky-start trap, and picking K by the elbow.

  Part 4 -- The Family Tree: Hierarchical Clustering and the Dendrogram
    post 19/28 . 2026-06-09 . part 4 of 6 . hierarchical-clustering, dendrogram, linkage
    No K up front: start with everyone alone and marry the two closest groups over and over,
    recording each wedding's height. Read the family tree, cut it wherever you like, and the
    cut decides how many groups. Single, complete, average, and Ward linkage compared.

  Part 5 -- Both Tools on NCI60: PCA and Clustering on Real Gene Data
    post 20/28 . 2026-06-09 . part 5 of 6 . pca, clustering, nci60, case-study
    All three tools on one hard sheet: 64 samples, 6,830 genes -- more columns than rows.
    Crush with PCA, carve with K-means and the family tree, then unseal the cancer types and
    grade the blind groups. Peeking at labels is a scorecard, never an input.

  Part 6 -- Filling the Blanks: Recommender Systems and Matrix Factorisation
    post 21/28 . 2026-06-09 . part 6 of 6 . recommender-systems, matrix-factorisation
    A sheet that is mostly holes: users x movies, a rating only where someone watched. Split
    the holey grid into two skinny full sheets of hidden tastes and flavours -- PCA's
    scores-times-recipe shape again. Then the lab end to end: humble each movie column, drop
    lazy zeros, and run rebuild-restore rounds until the marks freeze -- past the hand recipe
    that died of its own echo.


  CHAPTER 7 . BUILDING A NEURAL NETWORK FROM SCRATCH -- deep learning, forward then backward
  -----------------------------------------------------------------------------------------

  Part 1 -- Stacked Rooms and One Walk by Hand: How a Network Computes a Guess
    post 22/28 . 2026-06-11 . part 1 of 2 . neural-network, deep-learning, relu, sigmoid
    A fully self-contained start -- no prior posts needed. One room of clerks draws straight
    lines; stacked rooms draw anything, but only if a non-linear bend (the zero-out rule,
    ReLU) breaks the collapse between them. The forward pass walked by hand: three-cut split,
    humbling the columns, every multiplication of one patient shown, the S-curve derived from
    odds, and cross-entropy loss. Ends with a built network whose dials have not yet moved.

  Part 2 -- Rolling Downhill by Hand: How a Network Learns
    post 23/28 . 2026-06-11 . part 2 of 2 . backpropagation, gradient-descent, adam, dropout
    The dials finally learn. Backpropagation worked by hand on a one-input, one-hidden,
    one-output network: the chain rule link by link, the (guess - truth) shortcut derived,
    every slope checked against a brute-force wiggle, then the error sent one room further
    back through the dead-clerk gate. Learning rate, Adam, dropout as co-adaptation, the
    overflow clip, three real debugging mistakes, and the end-to-end run (0.974 exam).


  CHAPTER 8 . KEEPING A NETWORK HONEST -- the fight against memorising
  -------------------------------------------------------------------

  Five Machines Against Memorising: A Tax, a Coffee Break, a Fire Alarm, and a Humbler
    post 24/28 . 2026-06-12 . overfitting, regularisation, dropout, batchnorm
    A network that learns too well memorises the study pile's freckles and flunks new
    patients. One plain machine catches the disease on a sheet of clothing photos; four cures
    each fight it differently -- a tax on big dials (L2), sending clerks home (dropout), a fire
    alarm that stops at the valley floor (early stopping), and a humbler between floors (batch
    norm). Softmax and its ten chances derived by hand, then all five judged on one sealed
    pile, with the study-minus-practice gap naming the biggest memoriser.


  CHAPTER 9 . MACHINES THAT LOOK AT PICTURES -- convolution, by hand
  -----------------------------------------------------------------

  Part 1 -- A Magic Paper Slid Over a Photo: How a Picture Network Sees
    post 25/28 . 2026-06-12 . part 1 . cnn, convolution, pooling, computer-vision
    Flatten a photo and you destroy every "next to" -- so this machine keeps the picture folded
    and slides a tiny 3x3 window (a magic paper of nine dials) across it, lighting up where its
    shape sits. Built one worker at a time with every wrong picture corrected: the edge-finder
    that scores 60 on an edge and 0 on flat grey, 27 dials across three colours, 32 inspectors
    into 32 score-sheets, the shrink boss that keeps the loudest of each 2x2, the second floor
    drilling through all 32, and the punchline -- 262,208 of 282,250 dials hide in one Dense
    floor because the small papers are reused everywhere. (Colour photos, CIFAR-10.)

  Part 2 -- The Deep Factory: Humbler, Send-Home, and the Confusion Sheet
    post 26/28 . 2026-06-13 . part 2 . cnn, batch-norm, dropout, confusion-matrix
    The simple factory lands ~70%. Three layers of armour push it further: a humbler steadies
    every inspector's 65,536 numbers to middle 0, scatter 1 (arithmetic done by hand, full
    clerk count); send-home zeroes a random 25% of sheet entries to break secret teams (one
    coin per entry, off at exam time); the deep factory inserts both in the right order --
    floor → humbler → boss → send-home. Then the confusion sheet: sort 10,000 test cards
    into 10 true-animal piles, sort each pile by guess, and the biggest off-diagonal cell
    names cat↔dog as the most-confused pair. Ends with reading the magic-paper shape (3,3,3,32)
    directly from the trained floor.


  CHAPTER 10 . MACHINES THAT READ WORDS -- recurrent networks, by hand
  -------------------------------------------------------------------

  Part 1 -- Words Into a Machine: The Notepad and the Walking Worker
    post 27/28 . 2026-06-13 . part 1 of 2 . rnn, embedding, nlp, sequence-models
    A photo was already numbers; a review is WORDS, and a factory can only multiply numbers.
    Number the 10,000 commonest words by frequency, pad every review to 100, then a NOTEPAD
    swaps each word-number for a learned 32-number note (so "boring" and "dull" drift close,
    and "not" can carry a flip). One walking worker reads the 100 notes IN ORDER, reusing one
    dial-set to rewrite a 32-number memory -- the RNN cell, walked by hand word by word -- and
    a final S-curve clerk reads the last memory for thumbs up or down. (IMDB sentiment.)

  Part 2 -- The Two-Memory Worker: How an LSTM Remembers Far-Back Words
    post 28/28 . 2026-06-13 . part 2 of 2 . lstm, bilstm, sequence-models
    The plain worker crushes its one memory every word, so word 1 fades to nothing by word 90.
    The fix, derived by hand: carry TWO memories -- a long keep (A, never crushed, free to grow
    past 1) and a spoken recent (B) -- and let three 0..1 voters decide what to keep, admit, and
    speak. That is the LSTM (the old RNN cell plus three dimmer knobs), shown with a full
    real-number combine. Then the BiLSTM reads the review both directions and glues the two
    memories, an honest four-way comparison, and a blunt note: the fade is a theorem, the exact
    wiring is an engineered choice.


----------------------------------------------------------------------------------------------
  the method: ban the jargon . draw it first . solve by pencil . then write the code
  built with: pandas, scikit-learn, NumPy, and a pencil . pure HTML+CSS, no JavaScript
  follow along: subscribe via RSS . dark mode follows your system setting
  (c) 2026 Rahul Rai . no cookies, no trackers . source on GitHub
==============================================================================================