==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 9 . MACHINES THAT LOOK AT PICTURES . PART 2 OF 2
  The Deep Factory: Humbler, Send-Home, and the Confusion Sheet
  Posted: 2026-06-13 . Author: Rahul Rai . Tags: cnn, batch-norm, dropout, confusion-matrix
  ============================================================================================

  PATH . post 26 of 28
    <- prev:  Chapter 9, Part 1: A Magic Paper Slid Over a Photo
       next:  Chapter 10, Part 1: Words Into a Machine ->

  Part 1 built the SIMPLE factory: two floors of inspectors with magic papers, one shrink
  boss each, ironed flat, sorted by two clerk floors. Trained for five loops it lands around
  60% on CIFAR-10. This post adds the armour that lets a network keep climbing where the
  plain one stalls: AUGMENTATION (jiggled copies of each photo), a HUMBLER that steadies
  every inspector's numbers, and a SEND-HOME rule that zeroes random entries. A blunt,
  honest warning up front: at only five training loops these add almost nothing to the
  score -- the deep factory lands ~60.5% against the plain factory's ~60.2%. Their payoff
  is in the LONG run, over many more loops, where the plain factory memorises and rots while
  the armoured one keeps improving. We build them by hand anyway, because the mechanism is
  the lesson, not the five-loop number. Then we crack the CONFUSION SHEET -- a ten-by-ten
  pile model that names the two animals the machine mixes up most -- and finally crack open
  floor 1 to read the SHAPE of its magic papers directly.

  Two things you need from Part 1: (1) an inspector covers a 3x3 patch of 27 numbers and
  writes ONE score, (2) the shrink boss halves the sheet by keeping the loudest of each 2x2.
  Everything else is rebuilt here.

  Pencil and scratch paper out. Both new tools get real arithmetic.


  ## IN HAND

  IN HAND: simple factory. 32x32x3 photo → 32 inspector papers → boss → 64 papers → boss
  → flatten 4096 → Dense(64) → Dense(10). Dial count: 32x28 + 64x289 + 4096x64+64
  + 64x10+10 = 896 + 18,496 + 262,208 + 650 = 282,250. Five loops → ~60% on the exam pile.
  The rest of this post adds three cures -- jiggled copies, a humbler, a send-home -- then
  reads the trained machine.


  ## Training in Handfuls of Sixty-Four (Q5)

  Before any cure, recall how the dials get tuned at all. The study pile holds about 42,000
  photo-cards. The factory does NOT study them one at a time, nor all at once. It GRABS a
  handful of 64, runs all 64 through the whole factory, measures how wrong the 10 chances
  came out (averaged over the 64), and turns EVERY dial a tiny notch toward right. Then it
  drops that handful and grabs the next 64.

    cards per loop      = 42,000
    grab size           = 64
    grabs per loop      = 42,000 / 64 ≈ 657 dial-turns
    five loops          = 657 x 5 ≈ 3,285 dial-turns total

  Why grab 64 and not 1? A single card's wrongness is jumpy -- one weird cat would yank
  every dial. Averaging the wrongness over 64 cards steadies the pull, so each dial-turn
  rides the trend of 64 photos, not the whim of one. This handful of 64 is the same handful
  the humbler will pool over later, so hold the number 64 in your pocket.

  How does the factory know right from wrong? Every card carries its true answer on the
  back. If the back says CAT and the clerks gave cat a chance of 0.61, the wrongness is
  small (0.61 is near 1.0); if they gave cat only 0.05, the wrongness is huge. That
  wrongness is exactly what drives every dial-turn. No back-answer, no learning.

    >> YOUR TURN
       Study pile of 50,000 cards, grab size 100, three loops. How many dial-turns?

       check your slate: 50,000 / 100 = 500 grabs per loop. 500 x 3 = 1,500 dial-turns.


  ## Jiggled Copies: Augmentation (Q6)

  IN HAND: study pile ~42,000 cards, grabbed 64 at a time, ~657 dial-turns per loop. The
  plain factory sees the EXACT same 42,000 cards every loop. This section adds the first
  cure -- and it costs not a single new dial.

  The disease: shown the identical cards loop after loop, the factory memorises exact
  dot-positions -- "a cat is brightness 0.8 at spot (14,11)" -- instead of the general
  shape of a cat. It scores high on the study pile and flunks new photos.

  The cure: before a card is studied, make a JIGGLED copy of it -- same animal, moved a
  little. Four jiggles, all label-safe:

    tilt        up to 15 degrees either way
    shove across  up to 10% left or right
    shove down    up to 10% up or down
    mirror        flip left-to-right

  Because every grab now serves freshly jiggled cards, the factory never sees the identical
  photo twice. It can no longer memorise frozen dot-positions; it is forced to learn what
  makes a cat a cat across every tilt and shove.

  FLIP-SAFE IS NOT THE SAME AS SYMMETRIC. A mirror flip is allowed only when the flipped
  card is STILL a true example of its label:

    cat   → mirror → still a cat            label TRUE   ✓ safe to flip
    "2"   → mirror → backwards-2            label LYING  ✗ never flip
    "b"   → mirror → "d"                    label WRONG  ✗ never flip

  You do not need a cat to fold onto itself. You need CAT-NESS to survive the mirror. All
  ten CIFAR things are flip-safe (a mirrored truck is still a truck), so the lab flips.

  Two rules that never bend: (1) the jiggle is REBUILT into a fresh machine -- you do not
  reuse a trained one; (2) the EXAM pile is NEVER jiggled. You grade on honest, un-jiggled
  cats, or the grade is a lie.


  ## Why Scores Drift -- The Problem the Humbler Solves

  After inspector 1 fills his 32x32 score-sheet, the values might all sit between 5 and 45.
  Inspector 2's sheet might run -3 to +3. Inspector 3 might range from 0 to 200.

  These wild scales are not wrong -- the zero-out (ReLU) handles negatives, and the next
  floor's dials can compensate in principle. In practice, one floor screaming values of 200
  while another whispers 0.03 makes the dial-nudging step fragile. The learning step has
  to be calibrated for each inspector's private scale simultaneously. Miss by a fraction on
  one and the whole factory's dial updates blow up.

  The fix: before the shrink boss processes a floor's sheets, MAKE EVERY INSPECTOR'S
  NUMBERS SPEAK THE SAME LANGUAGE. Pull each inspector's pile to middle 0, scatter 1.
  That is the humbler.


  ## The Humbler by Hand (Batch Normalisation)

  The humbler works on ONE inspector's pile at a time. During training, the machine
  processes a HANDFUL of photos together -- say 64. So inspector 1 produces not one
  32x32 sheet but 64 of them stacked: 64 x 32 x 32 = 65,536 numbers.

  The humbler reads all 65,536 and computes two summary numbers:

    SETTLED MIDDLE  = (sum of all 65,536 numbers) / 65,536     the average
    SETTLED SCATTER = root( (sum of (each - middle)^2) / 65,536 )   the spread

  Worked example (made-up numbers):

    65,536 numbers for inspector 1, suppose their sum = 1,638,400
    settled middle = 1,638,400 / 65,536 = 25

    Clerk count for the middle alone:
      65,535 additions (running total) + 1 division = 65,536 arithmetic steps.

    Now the scatter. Suppose the sum of (each - 25)^2 across all 65,536 = 1,638,400:
      settled scatter = root( 1,638,400 / 65,536 ) = root(25) = 5

    Clerk count for the scatter:
      65,536 subtractions + 65,536 squarings + 65,535 additions + 1 division + 1 root
      = 262,109 more steps.

    Total clerk count for settled middle + scatter on inspector 1's pile: ~328,000 strokes.
    A room of 32 clerks clears it in ~10,250 steps each -- done well before lunch.

  Now STANDARDISE every number in the pile:

    standardised = (number - settled middle) / settled scatter

    number 30 → (30 - 25) / 5 = 5 / 5 = 1.0
    number 20 → (20 - 25) / 5 = -5 / 5 = -1.0
    number 25 → (25 - 25) / 5 = 0 / 5 = 0.0

  After standardising, ALL 65,536 numbers have middle = 0, scatter = 1. Every inspector's
  pile now speaks the same scale.

  But stripping the scale entirely might throw away useful information. So two extra dials
  per inspector are added -- a STRETCH and a SHIFT -- tuned by wrongness like all other dials:

    humbled output = stretch x standardised + shift

    if stretch = 1, shift = 0: output equals the standardised number unchanged
    if stretch = 2, shift = 3: output = 2 x standardised + 3  (scale restored, centre moved)

  These two dials are the ONLY two parts of the humbler the answer key touches.

  FOUR numbers total per inspector, and they are NOT all the same kind:

    SETTLED MIDDLE  -- remembered running average; answer key NEVER touches it
    SETTLED SCATTER -- remembered running average; answer key NEVER touches it
    STRETCH         -- tuned by wrongness (a regular dial)
    SHIFT           -- tuned by wrongness (a regular dial)

  The settled middle and scatter are running averages of what the inspector has seen so far.
  THE DIARY, worked by hand -- each new handful of 64 photos updates a plain running average:

    first 64 photos:  this handful's middle = 25  →  diary = 25
    next  64 photos:  this handful's middle =  5  →  diary = (25 + 5) / 2 = 15
    next  64 photos:  this handful's middle =  9  →  diary = (25 + 5 + 9) / 3 = 13

  The diary blends ONLY the middles it has seen. The back-of-card answer -- cat? right?
  wrong? -- NEVER touches it. That is what "remembered, not tuned" means.

  Why the diary exists: at exam time ONE photo arrives alone, with no handful of 64 to
  average over. The humbler cannot compute a fresh middle from a single photo, so it uses
  the frozen diary number (13 above) instead. Every exam photo gets the same humbling, no
  matter which photos happen to sit beside it -- an honest, steady grade.

  ORDER inside the deep factory (this order is fixed, never shuffled):
    inspector floor → HUMBLER → shrink boss → send-home

  Steady FIRST. Then shrink. Then silence.


  >> YOUR TURN
     Inspector 2: settled middle = 10, settled scatter = 4, stretch = 1, shift = 0.
     A number in the pile reads 18. What does the humbler write in its place?

     check your slate: standardised = (18 - 10) / 4 = 8 / 4 = 2.0.
     humbled output = 1 x 2.0 + 0 = 2.0.
     The humbler replaced 18 with 2.0.


  ## Send-Home -- Randomly Silencing Numbers (Dropout)

  IN HAND: humbler standardises inspector 1's 65,536 numbers to middle 0, scatter 1
  (settled middle 25, scatter 5 from above). Stretch = 1, shift = 0. Humbler done.
  Then the shrink boss halves to 256 per inspector. This section silences random entries
  AFTER the boss and before the next inspector floor.

  The problem send-home solves: SECRET TEAMS. Over many loops, inspector 7 might learn
  to score +10 at a spot exactly when inspector 22 scores -10 there, and the clerk floor
  below always uses 7 and 22 together to cancel noise. This secret team works perfectly on
  the study pile and fails on new photos -- it is memorisation, not pattern-finding.

  The cure: before every training step, flip a coin for every number on every sheet.

    heads (75% chance): keep the number
    tails (25% chance): replace it with 0

  One entry, one coin. The 25% is NOT "zero one entry per group of four" -- every entry
  flips its own independent coin. Sometimes two in a row get zeroed; sometimes none do.
  The secret team cannot rely on both members showing up, so it never forms.

  To keep the surviving numbers' expected total unchanged, scale them up:

    surviving entry → surviving entry x (1 / (1 - 0.25)) = surviving entry x 4/3

    Example on four entries [7, -3, 0, 5], coins: heads, tails, heads, heads:
      7  → heads → 7 x 4/3 ≈ 9.33     kept, scaled up
      -3 → tails → 0                  zeroed
      0  → heads → 0 x 4/3 = 0        kept, still zero
      5  → heads → 5 x 4/3 ≈ 6.67    kept, scaled up
      output: [9.33, 0, 0, 6.67]

  Clerk count for 65,536 entries: 65,536 coin flips + roughly 49,152 multiplications
  (about 75% survive) = ~114,688 steps.

  CRITICAL: send-home is OFF at exam time. One sealed photo arrives; every number passes
  through unchanged. No coins, no zeroing, no scaling. The humbler's remembered middle and
  scatter take over for steady scaling; send-home simply steps aside.


  >> YOUR TURN
     Six entries: [4, -2, 8, 0, 3, 6]. Coin results in order: tails, heads, heads, tails,
     heads, tails. Send-home rate 25%, scale factor 4/3. What is the output?

     check your slate:
       4  → tails → 0
      -2  → heads → -2 x 4/3 = -8/3 ≈ -2.67
       8  → heads →  8 x 4/3 = 32/3 ≈ 10.67
       0  → tails → 0
       3  → heads →  3 x 4/3 = 4.0
       6  → tails → 0
     output: [0, -2.67, 10.67, 0, 4.0, 0]


  ## The Deep Factory -- Full Build (Q7)

  IN HAND: simple factory, 282,250 dials, ~60% at five loops. Two new floors: humbler
  (standardise over the 64-photo handful, then stretch + shift, four numbers kept per
  inspector), send-home (25% coin, off at exam time). This section inserts them into the
  simple factory in the fixed order: floor → HUMBLER → boss → send-home.

    photo 32 x 32 x 3
    → inspector floor 1 (32 inspectors, 3x3x3 papers)   → 32 score-sheets, 32x32
    → HUMBLER (65,536 numbers per inspector, pooled over 64 photos)  → 32x32x32
    → boss (keep loudest of each 2x2)                    → 16x16x32
    → send-home (zero 25% of entries at random)
    → inspector floor 2 (64 inspectors, 3x3x32 papers)  → 64 score-sheets, 16x16
    → HUMBLER                                            → 16x16x64
    → boss                                               → 8x8x64
    → send-home (zero 25%)
    → iron (flatten 8x8x64)                              → 4096
    → clerk floor, Dense(64, relu)                       → 64
    → send-home (zero 50% -- clerks get the stricter coin)
    → clerk floor, Dense(10, softmax)                    → 10 chances

  Dial count for the deep factory. The humbler keeps FOUR numbers per inspector, and the
  machine stores and counts all four -- the two tuned (stretch, shift) AND the two
  remembered (settled middle, settled scatter). So a humbler over 32 inspectors carries
  32 x 4 numbers, not 32 x 2:

    Conv1:    32 x (3x3x3 + 1 nudge) = 32 x 28  =     896
    Humbler1: 32 x 4 numbers each              =     128
    Conv2:    64 x (3x3x32 + 1 nudge) = 64 x 289 = 18,496
    Humbler2: 64 x 4 numbers each              =     256
    Dense64:  4096 x 64 + 64                    = 262,208
    Dense10:    64 x 10 + 10                    =     650
                                          total: 282,634

  The two humblers add 384 numbers (128 + 256). The simple factory had 282,250; the deep
  factory has 282,634 -- barely more, but far steadier in training. (If you only counted
  the two TUNED dials per inspector you would get 282,442; the machine's own count_params
  reports 282,634 because the remembered diary numbers are stored too.) Send-home, boss,
  and iron add ZERO numbers -- they only zero, shrink, and reshape.

  Note on the batch axis: at training time, 64 photos pass through the factory together.
  Inspector 1 therefore makes 64 sheets at once -- one per photo. The HUMBLER reaches
  ACROSS all 64 photos' worth of sheets to compute the average and scatter. That is the
  only floor that sees across the batch. Every other floor -- boss, send-home, clerks --
  is photo-blind: each photo's sheets never touch another photo's sheets.


  ## The Confusion Sheet -- Which Animals Does the Machine Mix Up? (Q8)

  IN HAND: deep factory trained and exam pile sealed (10,000 test photos, roughly 1,000
  per animal). The machine gives 10 chance-scores per photo. This section builds the pile
  model that names the single most-confused animal pair.

  Lay 10 empty piles on the table, one per TRUE animal. Flip through every test card:

    each card → drop onto the pile named by its BACK (the true animal)
                NEVER by its front (the guess)

  When the flipping is done, every card on the CAT pile truly IS a cat. Some were guessed
  right; some were guessed wrong -- but they all truly are cats.

  Now open one pile and sort it by FRONT (the guess):

    CAT pile (all cards truly cat, say 1,000 of them):
      guessed cat   : 380   ← matched (truth = guess)
      guessed dog   :  95   ← biggest stray
      guessed deer  :  12
      guessed bird  :   7
      guessed horse :   5
      (the rest, adding to 1,000 total)

  Do that for all 10 piles. Stack the sorted piles together:

                       GUESSED →
              plane auto bird  CAT  deer  DOG  frog horse ship truck
    TRUE
    plane  [ ...                                                     ]
    ...
    CAT    [ ...            380   ...   95  ...                     ]  ← row 3
    ...
    DOG    [ ...             88  ...   390  ...                     ]  ← row 5
    ...

  One pile = one ROW (the true animal).
  Sorted counts within a pile = the COLUMNS (what it was guessed as).
  Where a row meets its own column = the DIAGONAL (matched, hits).
  Everything off the diagonal = a miss.

  All 100 cells sum to 10,000 (every card sits somewhere on the sheet).

  FINDING THE BIGGEST MISS:

  Zero the diagonal (hits are not mistakes) and take the largest remaining number.

    In the example: CAT row, DOG column = 95.
    (row 3, col 5) = (true cat, guessed dog).

  Reading a cell: (3, 5) means row 3 (truth = cat), column 5 (guess = dog).
  Meaning: go to the CAT pile. The biggest STRAY corner -- 95 cards truly cat but
  called dog -- is the largest wrong-corner anywhere on the whole 10x10 sheet.

  Cat-and-dog at 32x32 pixels are furry four-legged blobs at the same scale. The machine's
  biggest confusion is sensible, not random. (32x32 is the DATA's limit, not a factory
  stupidity -- even a human eye struggles at that resolution.)

  THE COPY TRAP:

  The test cell checks that the confusion sheet still sums to 10,000. If you zero the
  diagonal on the REAL sheet to find the biggest miss, the hits vanish, the sum drops
  below 10,000, and the test crashes.

    WRONG: np.fill_diagonal(q8_conf_matrix, 0)   <- mutates the real sheet, sum breaks
    RIGHT: off = q8_conf_matrix.copy()            <- always work on a copy
           np.fill_diagonal(off, 0)


  ## One Number Per Factory -- The Selection Rule (Q9)

  IN HAND: three factories trained (simple, augmented, deep). Each produced a 10x10
  confusion sheet. This section picks the winner with one number per factory.

  Do NOT pick per animal -- that gives 10 numbers per factory and no single winner.
  Use the ONE number that summarises the whole sheet:

    accuracy = (sum of the diagonal) / (total cards on the sheet)

  At five training loops, the three factories land close together -- all near 60%:

    simple (Q3):     diagonal sum / 10,000 ≈ 0.602
    augmented (Q6):  diagonal sum / 10,000 ≈ 0.605 -- 0.610  (jiggles help a little even at 5)
    deep (Q7):       diagonal sum / 10,000 ≈ 0.605           (humbler barely shows at 5 loops)

  Largest wins. The gap is small because five loops is not enough to show the long-run
  payoff. Over fifty or a hundred loops the deep and augmented factories keep climbing; the
  plain one stalls. The comparison is honest: pick whichever scored highest on the sealed
  pile you never touched during training.

  The confusion sheet stays whole -- never altered to make this calculation.


  ## Reading the Filters -- What Shape Are the Magic Papers? (Q10)

  IN HAND: trained simple factory (model_cnn from Q3). Floor 1 = Conv2D(32, 3x3).
  This section cracks open that floor and reads the shape of its 32 magic papers.

  Every inspector floor stores TWO piles in the machine's memory:

    filters = the DIALS (the magic papers themselves)   shape (3, 3, 3, 32)
    biases  = the NUDGES (one per inspector)             shape (32,)

  The nudges are the "+32" you counted in Q3's dial total. Q10 asks only for the
  dials' shape -- grab both piles but only store the first.

  Reading the shape (3, 3, 3, 32):

    (3, 3, 3, 32)
      |  |  |   |
      |  |  |   32 inspectors (how many magic papers)
      |  |  3   color-sheets of dials  (R, G, B -- one sheet per color)
      |  3       tall  (the paper is 3 dots tall)
      3           wide  (the paper is 3 dots wide)

  The magic paper is 3 wide x 3 tall (the little window that moves over the 32x32
  photo -- NOT 32x32 itself), 3 deep for the three colors, and there are 32 of them.

  Both views of the paper fit the same shape:

    "1 paper, 3 colors deep, 27 dials"  =  "3 color-sheets of dials that ADD to 1 number"

  q10_filter_shape = (3, 3, 3, 32).


  ## Common Tripwires

  Built from the live lab session. Every confusion actually hit, in the order it bit.

  TRIPWIRE 1  The batch axis -- "1 photo → 1 sheet."
    At training time, 64 photos pass through together. Inspector 1 makes 64 sheets,
    one per photo. 32 inspectors x 64 photos = 2,048 sheets in flight simultaneously.
    The HUMBLER is the only floor that reaches across all 64 photos' sheets to compute
    the average and scatter. Every other floor is photo-blind.

  TRIPWIRE 2  "The humbler has 4 numbers -- aren't they all tuned by wrongness?"
    No. Only stretch and shift are dials tuned by wrongness. Settled middle and settled
    scatter are running averages -- they are REMEMBERED but the answer key never touches
    them. They are used only at exam time when a single photo arrives.

  TRIPWIRE 3  "Send-home zeroes workers, or the whole image."
    Neither. It zeroes NUMBERS on the scratch sheets (individual entries). Inspector 3
    still runs, his paper still moves, the photo is unchanged. Some of his output
    entries get set to zero; the others survive and are scaled up.

  TRIPWIRE 4  "Floor-2 magic paper must be 16x16 -- that is the sheet size after the boss."
    No. The paper is always small (3x3) and moves one step at a time. Its DEPTH
    auto-stretches to match what arrives (32 deep, since 32 inspector sheets fed in).
    A 16x16 paper couldn't move -- one position, no neighborhoods.

  TRIPWIRE 5  "The machine never forgets -- but we throw away the sheets?!"
    Two separate things.
    DIALS: never discarded; nudged every loop; kept forever. That is the learning.
    SHEETS: scratch paper; made for one photo; used; binned. Every photo, every loop.
    Even the 256 kept numbers after the boss get binned once the answer is computed.
    "Learns forever" = dials.  "Forgets instantly" = sheets.

  TRIPWIRE 6  "The boss deletes 75% of each inspector's work -- 768 of 1024 scores."
    Yes, and that is the design. Four near-identical alarm readings in one 2x2 region
    say no more than one. The loudest carries the information. The quiet three are dead
    forever -- the blame-nudging from later floors never flows back through them.

  TRIPWIRE 7  "1 in 4 send-home means group the entries into groups of 4 and zero one."
    No. Each entry independently flips its own coin. Sometimes two in a row get zeroed;
    sometimes none do. The independence is what breaks the secret teams.

  TRIPWIRE 8  Zeroing the diagonal to find the biggest miss.
    Always zero it on a COPY, never on the real sheet. The test cell checks
    np.sum(q8_conf_matrix) == 10,000. Zeroing the real diagonal drops the sum and crashes.

  TRIPWIRE 9  "Flip-safe means the image is symmetric."
    Not the same thing. A flip is safe if the FLIPPED CARD is STILL a true example
    of the label -- the label survives the mirror.
    cat → mirror → still a cat  ✓
    "2" → mirror → backwards "2"  ✗  (not a valid 2 anymore)
    You do not need the cat to be symmetric; you need CAT-NESS to survive the flip.

  TRIPWIRE 10  "32x32 is the factory's limit -- a better factory would fix it."
    The ceiling is in the DATA, not the factory. At 32 dots a side, even a human eye
    cannot reliably tell a blurry cat from a blurry dog. The machine's biggest confusion
    (cat↔dog) is the same pair that fools humans at that resolution. Sensible machine,
    sensible mistakes.

  TRIPWIRE 11  "get_weights() returns one pile -- the dials."
    It returns TWO: dials (shape (3,3,3,32)) and nudges (shape (32,)).
    Unpack both; store only the dials' shape for Q10.

  TRIPWIRE 12  "The 32 score-sheets after floor 1 are some kind of colour sheets."
    No. Colour meaning DIES at floor 1. Each of the 32 numbers at a given spot is the
    answer to a DIFFERENT QUESTION about that neighborhood:

      at (5,5):  inspector 1  "redness here?"  7.2
                 inspector 2  "top edge here?" 0.3
                 inspector 3  "corner here?"   5.1
                 ... 32 questions, 32 answers.

    Not 32 colour channels. 32 different measurements of the same region.


  ## The Labels, Last

    handful of 64 photos          batch (batch_size=64)
    dial-turn per handful         gradient update / optimiser step
    jiggled copy                  augmented sample
    jiggles (tilt, shove, mirror) data augmentation
    flip-safe label               label-preserving transform
    jiggle tool                   ImageDataGenerator (Keras)
    humbler / steadier            batch normalisation (BatchNorm, BN)
    settled middle                running mean
    settled scatter               running variance (= scatter squared); scatter = std
    stretch dial                  gamma (γ)
    shift dial                    beta (β)
    send-home (25% or 50%)        dropout (rate 0.25 or 0.5)
    scale factor 4/3              inverted dropout scaling: 1 / (1 - rate)
    confusion sheet               confusion matrix
    pile = one row                true class
    guess = column                predicted class
    diagonal = matched counts     true positives (per class)
    off-diagonal = misses         misclassifications
    one-number accuracy           overall accuracy = trace(C) / sum(C)
    filter dials                  filter weights / kernel weights
    shape (3, 3, 3, 32)           (height, width, in_channels, filters)
    deep factory order            Conv → BN → MaxPool → Dropout
    nudges                        biases


  ## Code, If You Want It

  Nothing above needed a computer; this section is for the day you meet one.

  Train the simple factory in handfuls (Q5):

  ```python
  history_cnn = model_cnn.fit(
      X_train, y_train,
      epochs=5,
      batch_size=64,               # handful of 64 cards per dial-turn
      validation_data=(X_val, y_val),
  )
  q5_val_acc = round(float(history_cnn.history['val_accuracy'][-1]), 3)
  # history[-1] = the last of the 5 loops; expect ~0.60
  ```

  The augmented factory (Q6) -- same shape as Q3, built fresh, jiggled cards:

  ```python
  from tensorflow.keras.preprocessing.image import ImageDataGenerator

  datagen = ImageDataGenerator(
      rotation_range=15,           # tilt up to 15 degrees
      width_shift_range=0.1,       # shove left/right up to 10%
      height_shift_range=0.1,      # shove up/down up to 10%
      horizontal_flip=True,        # mirror left-right (all CIFAR things are flip-safe)
  )

  model_aug = Sequential([ ... same 7 floors as Q3 ... ])   # FRESH machine, not the trained one
  model_aug.compile(loss='sparse_categorical_crossentropy',
                    optimizer='adam', metrics=['accuracy'])

  history_aug = model_aug.fit(
      datagen.flow(X_train, y_train, batch_size=64),  # grab-64, each card freshly jiggled
      epochs=5,
      validation_data=(X_val, y_val),    # exam pile: NEVER jiggled
  )
  q6_aug_val_acc = round(float(history_aug.history['val_accuracy'][-1]), 3)
  ```

  The deep factory (Q7):

  ```python
  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import (
      Conv2D, MaxPooling2D, Flatten, Dense,
      BatchNormalization, Dropout
  )

  model_deep = Sequential([
      Conv2D(32, (3,3), activation='relu', padding='same', input_shape=(32,32,3)),
      BatchNormalization(),    # humbler: standardise over the 64-photo batch
      MaxPooling2D((2,2)),     # boss: keep loudest of each 2x2, halve the sheet
      Dropout(0.25),           # send-home: zero 25% of entries at random

      Conv2D(64, (3,3), activation='relu', padding='same'),
      BatchNormalization(),
      MaxPooling2D((2,2)),
      Dropout(0.25),

      Flatten(),               # iron flat: 8x8x64 = 4096 numbers
      Dense(64, activation='relu'),
      Dropout(0.5),            # stricter coin at the clerk level
      Dense(10, activation='softmax'),
  ])

  model_deep.compile(
      loss='sparse_categorical_crossentropy',
      optimizer='adam',
      metrics=['accuracy'],
  )
  ```

  The confusion sheet (Q8):

  ```python
  from sklearn.metrics import confusion_matrix
  import numpy as np

  preds   = model_deep.predict(X_test)       # 10 chances per photo
  y_guess = np.argmax(preds, axis=1)         # loudest chance → one animal per photo

  q8_conf_matrix = confusion_matrix(y_test, y_guess)   # 10x10 pile-sort
                                                        # rows = truth, cols = guess

  off = q8_conf_matrix.copy()               # ALWAYS copy first
  np.fill_diagonal(off, 0)                 # zero matched corners (hits not mistakes)
  i, j = np.unravel_index(np.argmax(off), off.shape)
  q8_most_confused = (int(i), int(j))       # (true animal row, guessed animal col)
  ```

  Reading the filter shape (Q10):

  ```python
  filters, biases = model_cnn.layers[0].get_weights()   # two piles: dials and nudges
  q10_filter_shape = filters.shape                       # (3, 3, 3, 32)
  ```