Five Machines Against Memorising: A Tax, a Coffee Break, a Fire Alarm, and a Humbler

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 8 . KEEPING A NETWORK HONEST
  Five Machines Against Memorising: A Tax, a Coffee Break, a Fire Alarm, and a Humbler
  Posted: 2026-06-12 . Author: Rahul Rai . Tags: overfitting, regularisation, dropout, batchnorm
  ============================================================================================

  PATH . post 24 of 28
    <- prev:  Chapter 7, Part 2: How a Network Learns
       next:  Chapter 9: How a Picture Network Sees ->

  Chapter 7 built a network and taught its dials to learn. This chapter is about the thing
  that goes wrong NEXT -- the network learns too well, memorises the study pile down to its
  freckles, and then flunks every patient it has not seen before. We build one plain machine
  that catches this disease, then four machines that each carry a different cure, and we judge
  all five honestly on a pile none of them was allowed to touch.

  If you have not read Chapter 7, here is all you need. A network is rooms of clerks. A clerk
  multiplies each number it receives by a learned DIAL, adds them with a NUDGE, and either
  bends the result at zero (the zero-out rule, ReLU, between rooms) or squashes it into a
  probability (the S-curve, at the exit). The dials learn by rolling downhill: compute how
  wrong the guess was, send that error backward, nudge every dial a little. That is the whole
  machine. This chapter changes nothing about it -- it only adds guards against memorising.


  ## A Disease Called Memorising

  Here is the disease in one picture. Train a plain network for thirty passes over the study
  pile and plot two scores each pass: how well it does on the study pile it learns from, and
  how well it does on a held-aside practice pile it only watches.

    score
    1.0 |                          study-score --> 0.99   keeps climbing
        |                 ________________
    0.9 |        _______/   practice-score --> peaks ~0.88, then SAGS
        |    ___/        \______
        +-----+-----+-----+-----+----- pass
              5     10    15    30

  For the first ten passes both climb together -- the network is learning real patterns that
  hold on both piles. Then they split. The study-score keeps rising toward a perfect 0.99,
  while the practice-score peaks and starts sinking. That split is the disease. The network
  has stopped learning what a coat looks like in general and started memorising the exact
  freckles of THESE study coats -- creases, lighting, sensor noise -- details the practice
  coats do not share. The gap between the two scores is the size of the rot.

  Everything in this chapter is a way to keep that gap small.


  ## Sheet of Clothes, and Goal

  A new sheet, not Chapter 7's tumours. Each row is one photograph of a piece of clothing,
  28 pixels across by 28 pixels tall -- 784 little grey numbers, each from 0 (black) to 255
  (white). The answer is one integer from 0 to 9, naming the bin:

    0 T-shirt   1 Trouser   2 Pullover   3 Dress   4 Coat
    5 Sandal    6 Shirt     7 Sneaker    8 Bag     9 Ankle-boot

  Two differences from Chapter 7 worth naming now. First, the answer is no longer yes/no --
  it is one of TEN bins, so the exit needs ten chances instead of one (the next section
  builds that). Second, the inputs are raw 0-to-255 greys, so we humble them the easy way:
  divide every pixel by 255, landing every number between 0 and 1. Then cut the pile the
  usual three ways -- study 60%, practice 20%, sealed exam 20%.

  The goal: build five machines that sort a photo into its bin, and find which one memorises
  least.


  ## Baseline Machine With No Defence

  Start with the plain machine, no cure at all. Four rooms of clerks:

    784 numbers in
      -> room 1: 256 clerks, zero-out rule
      -> room 2: 128 clerks, zero-out rule
      -> room 3:  64 clerks, zero-out rule
      -> exit:    10 clerks, softmax (ten chances -- built next section)

  Count the dials, do not trust the count -- compute it. Each clerk has one dial per incoming
  number plus one nudge:

    room 1: 784 x 256 + 256 = 200,704 + 256 = 200,960
    room 2: 256 x 128 + 128 =  32,768 + 128 =  32,896
    room 3: 128 x  64 +  64 =   8,192 +  64 =   8,256
    exit:    64 x  10 +  10 =     640 +  10 =     650
    -----------------------------------------------------
    total                                     = 242,762 dials and nudges

  Check the total by adding the four: 200,960 + 32,896 = 233,856; + 8,256 = 242,112;
  + 650 = 242,762. A quarter of a million dials, and almost all of them (200,960 of 242,762)
  live in room 1, wired to the 784 raw pixels. A machine with that many free dials and only a
  few thousand study photos is exactly the kind that memorises. Good -- that is the patient we
  want to cure.


  ## Softmax: Turning Ten Scores Into Ten Chances

  IN HAND: baseline machine built (784->256->128->64->10). Dial count: 200,960+32,896+8,256
  +650 = 242,762. The exit has 10 clerks producing raw scores -- any real numbers.
  This section turns those ten raw scores into ten chances that add to exactly 1.

  The exit has ten clerks now, one per bin. Each produces a raw score -- any number. We need
  to turn ten raw scores into ten CHANCES that are all between 0 and 1 and add up to exactly
  1, so they read as "a 66% chance it is a T-shirt, 25% a trouser, 9% a pullover."

  The S-curve from Chapter 7 squashed ONE score into ONE chance. For ten competing bins we
  need its big brother, softmax. The recipe: raise e to the power of each score (which makes
  every one positive), then divide each by the total.

    chance for bin i = e^(score i) / (sum of e^(score) over all bins)

  Work a small one by hand -- three bins with scores 2, 1, 0:

    e^2 = 7.389      (e is about 2.718, and 2.718 x 2.718 = 7.389)
    e^1 = 2.718
    e^0 = 1.000
    total = 7.389 + 2.718 + 1.000 = 11.107

    chance bin A = 7.389 / 11.107 = 0.665
    chance bin B = 2.718 / 11.107 = 0.245
    chance bin C = 1.000 / 11.107 = 0.090

  Check they add to 1: 0.665 + 0.245 + 0.090 = 1.000. The biggest score (2) won the biggest
  chance (0.665), the smallest score (0) the smallest -- and raising e to each power before
  dividing exaggerates the gaps, so a clear winner pulls well ahead. That exaggeration is the
  point: softmax is decisive, not wishy-washy.

  The wrongness ruler changes to match. For ten bins it is: take the chance the machine gave
  to the TRUE bin, and score -ln(that chance). If the true answer is bin 7 and the machine
  gave bin 7 a chance of 0.6, the wrongness is -ln(0.6) = 0.511. Gave it 0.9 instead? -ln(0.9)
  = 0.105, much smaller. Gave it a miserable 0.1? -ln(0.1) = 2.303, much larger. The machine
  is graded only on how much faith it put in the right answer. (The standard name carries the
  word "sparse" because the answer key is a single integer -- 7 -- not a row of ten 0/1 marks.)

  Try this: a machine faces two bins with scores 3 and 1. What chance does softmax give the
  first bin? (You will need e^3 and e^1; e^3 = e^2 x e = 7.389 x 2.718.)

  ...

  e^3 = 7.389 x 2.718 = 20.09. e^1 = 2.718. total = 20.09 + 2.718 = 22.81.
  chance first bin = 20.09 / 22.81 = 0.881. The first bin's score was only 2 higher, but
  softmax hands it an 88% chance -- the exponential stretch at work.


  ## Watching It Rot

  Run the baseline for thirty passes and the disease from the opening picture appears on
  schedule. The study-score marches toward 0.99; the practice-score peaks somewhere in the
  high 0.80s and then sags. The machine is memorising. Now the four cures. Each one is a
  different answer to the same question: how do you stop a machine with 242,762 dials from
  bending itself around noise?


  ## Cure 1 -- A Tax on Big Dials (L2)

  The first cure starts from a clue: a memorising machine needs HUGE, precise dials. To bend
  sharply around one freckle in one photo, some dial has to crank to an extreme value. A
  machine that only draws smooth, general rules keeps its dials modest. So: make big dials
  expensive.

  Add a tax to the wrongness. Every dial pays a fine equal to its own value squared, times a
  small rate (0.001 is common):

    dial of 50:   fine = 50 x 50 x 0.001 = 2.5        (enormous -- a 50 dial is punished hard)
    dial of 0.5:  fine = 0.5 x 0.5 x 0.001 = 0.00025  (a rounding error -- small dials are free)

  Squaring is what makes the tax bite the big ones: doubling a dial quadruples its fine. The
  machine now faces a trade. A big dial lowers the wrongness on a few memorised photos, but it
  costs tax on every single pass. Unless that dial earns its keep across the whole pile, the
  tax wins and the dial shrinks. The result is a machine of humble dials -- a smoother rule
  that generalises better. The standard name is L2 regularisation, and the rate (0.001) is a
  knob you pick, not a dial the machine learns.


  ## Cure 2 -- Sending Clerks Home (Dropout)

  Chapter 7 met dropout as a coffee break. Here is the disease it cures, stated sharply,
  because it is subtler than "big dials."

  Picture two clerks on a floor who have struck a private deal. Clerk 5 always shouts +10;
  clerk 6, whom the next floor reads together with clerk 5, always shouts -10. Their sum is
  +10 + (-10) = 0, which happens to look perfect to the floor above, so the wrongness never
  complains and their dials never get corrected. They have co-adapted -- their dials tuned
  to lean on each other rather than each standing on its own -- into a useless pair that
  survives only because they are always present together. Several such secret teams form,
  and the machine leans on them instead of learning honest, standalone features.

  Dropout breaks the deals. On each training pass, flip a coin for every clerk and send a
  fraction of them home -- their output forced to zero for that pass. With a rate of 0.3,
  about 30 of every 100 clerks sit out. The moment clerk 6 is sent home, clerk 5's +10 is no
  longer cancelled; the sum is +10, not 0; the lie is exposed; the tax of wrongness lands and
  the dials finally get fixed. Because no clerk can count on its partner being present, every
  clerk is forced to be useful on its own.

  Two things to keep straight. The home-sending happens ONLY during study; when the machine
  takes the practice or sealed exam, every clerk reports for duty. And the rate (0.3) is a
  knob you choose. Send too few home and the deals survive; send too many and the floor is too
  crippled each pass to learn anything.


  ## Cure 3 -- A Fire Alarm That Stops the Clock (Early Stopping)

  The first two cures change the machine. This one changes nothing about the machine -- it
  just knows when to quit.

  Recall the disease: the practice-score peaks, then sags. Equivalently, the practice-pile
  WRONGNESS (call it val-loss -- the total -ln cost on the practice pile) falls, bottoms out,
  and then climbs as memorising sets in. The best machine is the one at the bottom of that
  valley. Early stopping watches the val-loss every pass, remembers the lowest seen, and quits
  when it has gone too long without a new low.

  Watch it run. The "patience" is set to 5 -- five passes with no new low triggers the alarm:

    pass:      1     2     3     4     5     6     7     8
    val-loss: .50   .42   .39   .41   .40   .43   .44   .45
              low   low   LOW   +1    +2    +3    +4    +5  -> ALARM

  Pass 3 set the lowest val-loss, 0.39. Passes 4 through 8 each failed to beat it -- that is
  five strikes -- so the alarm fires after pass 8 and training stops. Then the needed last
  step: rewind the dials back to where they were at pass 3, the bottom of the valley. (The
  standard name for that rewind is restore-best-weights, and forgetting it is a classic
  mistake -- you stop at pass 8's worse dials instead of pass 3's best.) You set a high
  ceiling of passes -- say 100 -- knowing the alarm will almost always stop you long before.

  Try this: same alarm, patience 5, but a new run with val-losses .60 .55 .50 .48 .47 .49 .52
  .50 .51 .53. Which pass set the best, and at which pass does the alarm fire?

  ...

  The lowest is 0.47 at pass 5. Passes 6 through 10 all fail to beat it -- five strikes -- so
  the alarm fires after pass 10, and the dials rewind to pass 5.


  ## Cure 4 -- A Humbler Between Floors (Batch Normalisation)

  The last cure fixes a problem you would not guess at: the floors of the network keep moving
  the goalposts on each other.

  As room 1's dials get tuned, the numbers it hands room 2 swing wildly from pass to pass --
  one pass they are around 2, 5, 3; a few passes later, after the dials shifted, they are 600,
  900, 700; later still, 0.01, 0.03. Room 2 is trying to learn on ground that keeps tilting
  under it. It never gets steady footing.

  The fix: after each hidden floor, humble the hand-off the same way we humble raw inputs --
  but do it freshly for each handful of photos. Work one by hand. Say a clerk hands these five
  numbers to the next floor: 10, 20, 30, 40, 50.

    middle  = (10 + 20 + 30 + 40 + 50) / 5 = 150 / 5 = 30
    diffs   = 10-30, 20-30, 30-30, 40-30, 50-30 = -20, -10, 0, 10, 20
    squared = 400, 100, 0, 100, 400
    average = (400 + 100 + 0 + 100 + 400) / 5 = 1000 / 5 = 200
    scatter = root of 200 = 14.14
    humbled = -20/14.14, -10/14.14, 0, 10/14.14, 20/14.14 = -1.41, -0.71, 0, 0.71, 1.41

  Now the hand-off has middle 0 and scatter 1, steady every pass no matter how the dials
  below shifted. Room 2 gets firm footing.

  One refinement so the humbling is not a straitjacket: each floor also gets two rescue dials,
  a STRETCH (starts at 1) and a SHIFT (starts at 0), and the final hand-off is
  humbled x stretch + shift. If the floor decides the raw, un-humbled numbers were actually
  better, it can learn to stretch and shift its way back. A small bonus falls out for free:
  because each handful of 32 photos has a slightly different middle, the humbling jiggles a
  touch from handful to handful, and that mild random jiggle is itself a weak dose of
  anti-memorising. The standard name is batch normalisation. One caution: never place it after
  the exit -- it would wreck the ten softmax chances.

  Try this: a floor hands the next one these four numbers: 2, 4, 6, 8. Humble them by hand.

  ...

  middle = (2+4+6+8)/4 = 20/4 = 5. diffs = -3, -1, 1, 3. squared = 9, 1, 1, 9.
  average = 20/4 = 5. scatter = root of 5 = 2.236. humbled = -3/2.236, -1/2.236, 1/2.236,
  3/2.236 = -1.34, -0.45, 0.45, 1.34. Middle 0, scatter 1.


  ## One Honest Judge (Breaking the Sealed Pile)

  IN HAND: five machines trained on the study pile -- baseline, L2 tax, dropout, early-
  stopping, batch-norm. Practice pile was watched every pass and steered the fire alarm.
  This section picks the only pile that gives a clean, honest grade.

  Now grade all five machines: the baseline plus the four cures. But on which pile?

  Not the study pile -- every machine memorised pieces of it; scoring there is a memory test.
  Not even the practice pile, and here is the subtle part: the practice pile was PEEKED at
  thirty times (we plotted its score every pass), and for the early-stopping machine it
  actively steered when training halted. A pile you have looked at thirty times and used to
  make decisions is no longer a clean surprise. It is faintly tainted.

  The sealed exam pile is the only clean judge. It was looked at zero times -- sealed since
  the very first cut, never plotted, never used to choose a knob. Open it once now, score
  every machine on it, and that single number is the honest verdict. Whichever machine scores
  highest on the sealed pile wins. (In practice the four cures usually beat the baseline, and
  which cure wins depends on the sheet -- there is no permanent champion.)

  The rule this whole chapter rests on: a pile you make decisions with is a pile you have
  spent. Keep one pile sealed and unspent, or you will have no honest judge left at the end.


  ## Naming the Biggest Liar (the Memorising Gap)

  One more number tells you something the sealed score alone cannot: WHICH machine memorised
  least, regardless of who scored highest. For each machine, take its final study-score and
  subtract its final practice-score.

    big gap   = the machine learned the study pile far better than the practice pile = memoriser
    small gap = the machine treated both piles alike = honest, whatever its raw score

  A machine can score high AND have a big gap (it learned a lot, but memorised a lot too); or
  score modestly with a tiny gap (it learned less, but everything it learned was real). The
  smallest gap names the least-memorising machine -- usually one of the four cures, and often
  dropout or batch-norm. Two numbers, read together -- the sealed score and the gap -- tell
  the whole story: how good, and how honest.


  ## Common Tripwires

  Real struggles from running this lab, not invented ones.

  Humbling the wrong pile. Divide pixels by 255 -- fine, that is fixed and needs no pile. But
  if you ever compute a middle or scatter for humbling, compute it on the STUDY pile only. Use
  all the data and the sealed pile leaks into training, exactly as in Chapter 7.

  Leaving dropout on at exam time. If your practice score jumps around wildly every time you
  evaluate, dropout is still sending clerks home during grading. It must run only during
  study; every clerk reports for the exam. Most frameworks switch this automatically -- if you
  grade by hand, switch it off yourself.

  Forgetting to rewind after the fire alarm. Early stopping that stops at pass 8 but keeps
  pass 8's dials has thrown away the whole point. Rewind to the best pass (pass 3 in the
  example). The flag is restore-best-weights; set it.

  Batch-norm after the exit. Humbling the ten softmax scores destroys them -- they no longer
  add to 1, no longer read as chances. Humble between hidden floors only, never after the last.

  Judging on a spent pile. The most expensive mistake: reporting the practice score as your
  final grade. You peeked at it thirty times. Report the sealed pile, opened once.


  ## Standard Names for Everything Above

    Plain term used above                 Standard label
    ------------------------------------  -------------------------------------------
    memorising                            overfitting
    study / practice / sealed exam        train / validation / test
    ten scores into ten chances           softmax
    wrongness for ten bins                sparse categorical cross-entropy
    a tax on big dials                    L2 regularisation / weight decay
    the tax rate (0.001)                  the regularisation strength (lambda)
    sending clerks home                   dropout
    secret-team disease                   co-adaptation
    fire alarm that stops the clock       early stopping
    passes-with-no-new-low limit          patience
    rewind to the best pass               restore_best_weights
    humbler between floors                batch normalisation
    stretch and shift rescue dials        gamma (scale) and beta (shift)
    study-score minus practice-score      the generalisation gap


  ## Code, If You Want It

  Nothing above needed a computer: every dial count, every wrongness step, and every cure
  was pencil and arithmetic. This section is for the day you meet one.

  Five machines, one per cure. They share the same skeleton -- only the guard differs. Below,
  the baseline and the four cures, each as a small change to it.

  >> NEW TO PYTHON? Each named once:
       x / 255.0                          -- humble raw 0-255 greys into 0-1
       Dense(256, activation='relu')      -- a room of 256 zero-out clerks
       Dense(10, activation='softmax')    -- the ten-chance exit
       kernel_regularizer=l2(0.001)       -- Cure 1: the tax on big dials
       Dropout(0.3)                       -- Cure 2: send 30% of clerks home each pass
       EarlyStopping(patience=5, restore_best_weights=True)  -- Cure 3: the fire alarm
       BatchNormalization()               -- Cure 4: the humbler between floors
       sparse_categorical_crossentropy    -- wrongness when the answer is one integer

    import numpy as np
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
    from tensorflow.keras.regularizers import l2
    from tensorflow.keras.callbacks import EarlyStopping

    # x_train etc. are 28x28 photos flattened to 784 and divided by 255 (humbled to 0..1),
    # then cut 60/20/20 into study (x_train), practice (x_val), sealed exam (x_test).

    def compile_and_report(model):
        model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])
        return model

    # --- Baseline: no defence ---
    baseline = compile_and_report(Sequential([
        Dense(256, activation='relu', input_shape=(784,)),
        Dense(128, activation='relu'),
        Dense(64,  activation='relu'),
        Dense(10,  activation='softmax'),
    ]))

    # --- Cure 1: a tax on big dials (L2) ---
    l2_model = compile_and_report(Sequential([
        Dense(300, activation='relu', input_shape=(784,), kernel_regularizer=l2(0.001)),
        Dense(100, activation='relu', kernel_regularizer=l2(0.001)),
        Dense(10,  activation='softmax'),
    ]))

    # --- Cure 2: send clerks home (Dropout) ---
    dropout_model = compile_and_report(Sequential([
        Dense(300, activation='relu', input_shape=(784,)),
        Dropout(0.3),
        Dense(100, activation='relu'),
        Dropout(0.3),
        Dense(10,  activation='softmax'),
    ]))

    # --- Cure 3: the fire alarm (Early Stopping) ---
    early = compile_and_report(Sequential([
        Dense(256, activation='relu', input_shape=(784,)),
        Dense(128, activation='relu'),
        Dense(64,  activation='relu'),
        Dense(10,  activation='softmax'),
    ]))
    alarm = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    # early.fit(..., epochs=100, callbacks=[alarm])   # 100 is a ceiling; the alarm stops it

    # --- Cure 4: the humbler between floors (BatchNorm) ---
    batchnorm_model = compile_and_report(Sequential([
        Dense(256, activation='relu', input_shape=(784,)),
        BatchNormalization(),
        Dense(128, activation='relu'),
        BatchNormalization(),
        Dense(64,  activation='relu'),
        BatchNormalization(),
        Dense(10,  activation='softmax'),       # no BatchNorm after the exit -- it would
    ]))                                          # wreck the ten softmax chances

    # --- The one honest grade: open the sealed pile ONCE, per machine ---
    # loss, acc = model.evaluate(x_test, y_test, verbose=0)
    # The generalisation gap = final train accuracy - final val accuracy (smaller = honester).


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 8 -- Keeping a Network Honest):
    Part 1 (this post)

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================