==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 9 . MACHINES THAT LOOK AT PICTURES . PART 1
A Magic Paper Slid Over a Photo: How a Picture Network Sees
Posted: 2026-06-12 . Author: Rahul Rai . Tags: cnn, convolution, pooling, computer-vision
============================================================================================
PATH . post 25 of 28
<- prev: Chapter 8: Five Machines Against Memorising
next: Chapter 9, Part 2: The Deep Factory ->
Every machine so far read a flat row of numbers: thirty tumour measurements, or 784 pixels
laid out in one long line. That works until the numbers are a PHOTOGRAPH, where what matters
is not the value of any one pixel but how pixels sit NEXT to each other -- an edge is a
bright row above a dark row, a corner is where two edges meet. Flatten a photo into a line
and you throw away every "next to," because in a flat line a pixel's neighbours are scattered
784 places apart. This chapter builds the machine that keeps the neighbours: it slides a tiny
window over the picture and looks at small patches, one at a time.
Fair warning -- this is the most confusing machine in the book, and it confused me badly the
first time. So we go painfully slowly: one worker, one window, one patch, real numbers, and
every wrong mental picture corrected the moment it bites. If you have not read the earlier
chapters, here is all you need: a network learns numbers called DIALS by rolling downhill,
each worker multiplies its inputs by its dials and adds them up, and the zero-out rule (ReLU)
forces negatives to zero. Everything else, we build here.
## Sheet of Colour Photos, and Goal
A new pile, and louder than usual because it trips people: this is NOT the grey clothing of
Chapter 8. These are colour photographs, and colour changes the shape of everything.
Chapter 8: grey clothes, 28 x 28 = 784 numbers, one colour
this post: colour photos, 32 x 32 = 1024 spots, x 3 colours = 3,072 numbers each
Each photo is 32 spots across, 32 spots down. But every spot carries THREE numbers, not one:
how red it is, how green, how blue -- each from 0 to 1 (we divide the raw 0-to-255 by 255).
So one spot at position (5,5) might read red 0.8, green 0.2, blue 0.2, which your eye reads
as "reddish." A photo is really three stacked sheets -- a red sheet, a green sheet, a blue
sheet -- each 32 by 32.
The answer is one of ten bins:
0 airplane 1 automobile 2 bird 3 cat 4 deer
5 dog 6 frog 7 horse 8 ship 9 truck
The goal: read the 3,072 numbers of a new photo and name the bin.
## Folding the Flat String Into a Grid
The photo arrives as a flat string -- one long line of 3,072 numbers. The very first move is
to FOLD it back into its true shape: a 32-by-32 square, three colours deep. In code that is
reshape(-1, 32, 32, 3), and it loses nothing -- 32 x 32 x 3 = 3,072, the same numbers, just
arranged as a square again instead of a line.
flat string (3,072 numbers) folded grid
[120, 135, 98, ...] --fold--> 32 across
+------------+
| picture | 32 down
| here | x 3 colour sheets
+------------+
Why bother? Because the whole trick of this chapter is a window that moves ACROSS and DOWN
over the picture. A flat line has no across and down -- no neighbourhoods. The square does.
Moving a window over a two-direction square is exactly what gives the machine its name:
Conv-2D, the 2D being the two directions. Cut the pile the usual three ways first -- study
60%, practice 20%, sealed exam 20% -- then fold each photo.
## What One Magic Paper Does
Forget the whole machine. One worker, one small window. I will call the window a MAGIC PAPER:
a little 3-by-3 sheet of graph paper carrying nine numbers -- its dials. Lay it on the photo
and it covers a 3-by-3 patch of nine spots. It multiplies each spot under it by the matching
dial, adds the nine products into ONE number, and writes that number down. A big number
means "the shape I am tuned to find is HERE." A near-zero means "not here."
Make this concrete with a magic paper tuned to find a TOP EDGE -- bright above, dark below.
Its nine dials (made up, to show the mechanism):
1 1 1
0 0 0
-1 -1 -1
Now lay it on a patch that really is a top edge: a bright row (10s) above a dark row (-10s):
patch: 10 10 10
0 0 0
-10 -10 -10
multiply-add:
top row: (10 x 1) + (10 x 1) + (10 x 1) = 30
middle row: (0 x 0) + (0 x 0) + (0 x 0) = 0
bottom row: (-10 x -1)+ (-10 x -1)+ (-10 x -1) = 30
total = 30 + 0 + 30 = 60 <- BIG: "edge found here!"
Now lay the SAME magic paper on a flat grey patch, every spot 5:
(5 x 1) x 3 + (5 x 0) x 3 + (5 x -1) x 3 = 15 + 0 - 15 = 0
<- "nothing here"
Same nine dials, two patches: 60 versus 0. THAT is how a magic paper finds a shape -- it
lights up over the pattern it matches and stays dark everywhere else. A "shape to find" is
nothing more mystical than a particular pattern of nine dials. And the machine does not set
those dials by hand: they start random and get tuned downhill, loop after loop, until they
settle on patterns worth looking for.
>> YOUR TURN
The same top-edge magic paper (dials: row of 1s, row of 0s, row of -1s) lands on a patch
that is dark on top and bright below -- the OPPOSITE of an edge it likes:
patch: -10 -10 -10
0 0 0
10 10 10
What score does it write? (Work the three rows.)
check your slate:
top: (-10 x 1) x 3 = -30 ; middle: 0 ; bottom: (10 x -1) x 3 = -30.
total = -30 + 0 - 30 = -60. A big NEGATIVE -- this magic paper screams "the edge here is
upside-down from the one I hunt." (After the zero-out rule, -60 becomes 0: not my shape.)
## One Inspector, One Scan -- and the Traps
One worker carrying one magic paper is an INSPECTOR. Here is what an inspector actually does,
and here are the three wrong pictures everyone (me included) builds first.
TRAP 1 -- "the magic paper sits on one spot." No. It is 3 by 3; it covers NINE spots at once,
centred on one. The nine comes from the paper's size, not from the spot.
+---+---+---+
| d | d | d | o = the centre spot the paper sits on
+---+---+---+ d = its eight neighbours
| d | o | d | nine spots read, ONE score written (placed at o)
+---+---+---+
| d | d | d |
TRAP 2 -- "first one magic paper scans, then a second magic paper scans the next patch." No.
ONE inspector owns ONE magic paper the whole time -- the same nine dials. Moving to the next
patch is that same paper SLID over. A second magic paper exists only when a second inspector
is hired. Magic papers used by inspector 1: one, forever.
TRAP 3 -- "the scan chops the photo into separate 9-spot tiles, so it runs 1024/9 times." No.
The paper steps ONE spot at a time, so consecutive patches OVERLAP -- each step shares six of
its nine spots with the last:
centre (1,1): reads columns 0,1,2
centre (1,2): reads columns 1,2,3 <- slid +1, shares columns 1,2
centre (1,3): reads columns 2,3,4 <- shares columns 2,3
So the paper visits every spot as a centre. On a 32-by-32 photo that is 32 x 32 = 1,024
centres, one score each. (A small trick called padding -- a border of zeros around the photo
-- lets even the corner spots sit at a centre, so the output stays a full 32 by 32 rather
than shrinking.) The inspector's whole run:
one photo . one inspector . one magic paper . slid to 1,024 centres
-> 1,024 scores written -> ONE finished 32-by-32 score-sheet -> done
And the dials are FROZEN during a scan. They only get nudged BETWEEN loops, by how wrong the
final ten chances turned out. A scan is pure looking; the learning happens after.
Clerk-step count for one inspector on one photo:
at each centre: 27 multiplications + 26 additions = 53 arithmetic steps
1,024 centres per photo: 53 x 1,024 = 54,272 steps per inspector per photo
32 inspectors on floor 1: 32 x 54,272 = 1,736,704 steps per photo, floor 1 only
Tireless clerks: done before lunch. Pencil-and-paper worker: about a year per photo.
## Colour: Twenty-Seven Numbers Under the Paper
One more layer of truth, because the photo is three colours deep. When the magic paper covers
nine spots, each spot carries three numbers (red, green, blue). So the paper actually sits
over 9 x 3 = 27 numbers, and it carries 27 dials -- nine for red, nine for green, nine for
blue, all living in the same paper. All 27 multiply-adds collapse into ONE single number per
centre. Never three numbers, never a colour triplet out -- one number.
Watch a magic paper tuned to find REDNESS: all nine red-dials are 1, all green- and blue-dials
are 0. Lay it on a reddish patch, every spot red 0.8, green 0.2, blue 0.2:
nine red numbers: 0.8 x 1, nine times = 7.2
nine green numbers: 0.2 x 0, nine times = 0
nine blue numbers: 0.2 x 0, nine times = 0
add all 27 -> score 7.2 written at the centre
Now the same paper on a sky patch, every spot red 0.1, green 0.2, blue 0.9:
nine red: 0.1 x 1, nine times = 0.9 (green and blue ignored)
add all 27 -> score 0.9 written at the centre
7.2 over the red patch, 0.9 over the sky. Keep red, ignore green and blue, and the score IS a
redness measurement. Nobody chose "3 deep" -- the photo's three colours FORCE the paper to be
three deep. The paper's depth always auto-matches whatever arrives. Remember that line; it is
the key to the second floor.
>> YOUR TURN
A magic paper has all nine BLUE-dials set to 1 and all red- and green-dials 0 (a sky-finder).
Lay it on the same sky patch (red 0.1, green 0.2, blue 0.9 at every spot). What score?
check your slate:
blue only: 0.9 x 1, nine times = 8.1. Red and green multiplied by 0 add nothing.
Score 8.1 -- the sky-finder lights up bright over sky, where the redness-finder gave only
0.9. Different dials, different shape hunted, same multiply-add.
## A Floor of Thirty-Two Inspectors
IN HAND: one magic paper (3x3 wide, 3 colours deep = 27 dials + 1 nudge = 28 numbers).
One inspector slides it to 1,024 centres, writing one score at each: 54,272 steps total.
The top-edge paper scored 60 on a matching edge, 0 on flat grey, -60 on an inverted edge.
This section hires 32 inspectors at once, each with his own paper hunting a different shape.
One inspector finds one shape. A floor hires many. Conv2D(32, (3,3)) hires 32 inspectors,
each with his OWN magic paper -- his own 27 dials -- each hunting a different shape (one
vertical edges, one corners, one red blobs, and so on). Each inspector slides his paper over
the whole photo and fills his own fresh 32-by-32 score-sheet. Thirty-two inspectors, 32
score-sheets, stacked:
one 32 x 32 x 3 photo in
-> 32 inspectors slide their papers
-> OUT: 32 x 32 x 32 (still 32 across and down, but now 32 sheets DEEP)
Something important happens to the depth here. It went in as 3 (the colours) and comes out as
32 (the inspectors). And the meaning changed completely. At one spot, the 32 numbers are no
longer colours -- they are 32 different ANSWERS about that neighbourhood:
at (5,5): inspector 1 "redness here?" -> 7.2
inspector 2 "top edge here?" -> 0.3
inspector 3 "corner here?" -> 5.1
... 32 questions, 32 answers
Colour meaning dies at floor 1. From here on, depth means "how many different shapes we are
tracking," not "how many colours."
>> YOUR TURN
Conv2D(32, (3,3)) on a 32x32x3 photo. How many separate 3-by-3 magic papers exist on this
floor, and how many score-sheets come out?
check your slate:
32 papers (one per inspector) and 32 score-sheets. The "(3,3)" is the size of each paper;
the "32" is how many inspectors. Output: 32 x 32 x 32.
## A Shrink Boss That Keeps the Loudest
Thirty-two full 32-by-32 score-sheets is a lot of paper. A boss now shrinks each one.
MaxPooling2D((2,2)) chops a sheet into 2-by-2 blocks and keeps ONLY the loudest number in
each block, tossing the other three:
old 4 x 4 (made up): new 2 x 2:
7 2 | 1 0
3 1 | 0 4 --> 7 4
------+----- 8 6
8 0 | 6 2
1 5 | 3 1
Across and down both halve: 32 x 32 becomes 16 x 16. Depth is untouched -- the boss shrinks
each of the 32 sheets separately, so 32 sheets stay 32 sheets. The count: 1,024 scores on a
sheet become 256 kept, 768 thrown away -- 75% of the inspector's work binned, per sheet.
That sounds reckless. It is deliberate. Four nearly-identical alarms crowded into one tiny
region say no more than one alarm does: "my shape is somewhere around here." Keeping the
loudest and dropping the rest loses the exact spot but keeps the finding. The question quietly
changes from "what is at the exact dot (5,5)?" (1,024 answers) to "what is in the region
around (5,5)?" (256 answers) -- and region-sharpness is all the machine needs, since no two
cats ever sit on exactly the same dots anyway. A quiet block is safe to drop: quiet meant "my
shape is NOT here."
One worry worth killing now: does this mean the machine "forgets"? The DIALS are never
forgotten -- they are the learning, nudged every loop and kept forever. The score-sheets are
just scratch paper, made and binned for every single photo. Throwing scratch paper away is not
forgetting the lesson; it is clearing the desk for the next photo.
>> YOUR TURN
A 16-by-16 score-sheet goes through one more MaxPooling2D((2,2)). What size comes out?
check your slate:
Both directions halve: 16/2 = 8 by 8. Depth unchanged. So 16x16 becomes 8x8.
## A Second Floor That Drills Through All Thirty-Two
IN HAND: floor 1 (32 inspectors, each 3x3x3 = 27 dials + 1 nudge = 28 per paper; total
32 x 28 = 896 dials) + boss produced 32 score-sheets, each 16x16. At each spot the 32
numbers are shape-answers (redness, edge-ness, corner-ness...) -- colour meaning died at
floor 1. The boss tossed 75% of scores; the 896 dials are unchanged.
This section hires 64 new inspectors whose papers drill through all 32 sheets at once.
After floor 1 and its boss: 32 sheets, each 16 by 16, stacked -- inspector 1's on top down to
inspector 32's at the bottom, all lined up so that position (5,5) on every sheet covers the
same region of the original photo.
Floor 2 plays the SAME game, with one change that trips everyone. A floor-2 inspector does NOT
read one sheet at a time. His 3-by-3 magic paper covers nine places on the TOP sheet -- but at
each of those nine places he drills straight DOWN through all 32 stacked sheets, exactly as
floor 1's paper drilled through the three colours. Nine places, 32 levels deep:
his paper at centre (5,5):
nine places on the face (centre + 8 neighbours)
at each place, bore down through 32 levels
-> 9 x 32 = 288 numbers pulled up
-> x his 288 dials -> add -> ONE number, written at (5,5) of his fresh sheet
Same move as floor 1, just deeper. And the paper's depth (32) was never typed in -- it
auto-matched the 32 sheets arriving, just as floor 1's paper auto-matched the 3 colours. The
only thing the code names is the paper's width: Conv2D(64, (3,3)). The 64 is the number of
floor-2 inspectors; the (3,3) is the width; the depth takes care of itself.
Why drill through all 32? Because floor 2 is hunting COMBINATIONS. At one region the 32
numbers say "redness 7.2, edge 0.3, corner 5.1, ..." A floor-2 inspector reading all 32 at
once can fire on "fur-texture AND ear-curve in the same region" -- a thing visible only by
reading several floor-1 answers together. Floor 1 found simple shapes; floor 2 combines them
into bigger ones. Floor 2 hires 64 inspectors, makes 64 fresh 16-by-16 sheets, and the boss
halves again to 8 x 8 x 64.
compare: floor 1 floor 2
reads: photo 32x32, 3 deep stack 16x16, 32 deep
one place holds: 3 numbers (R,G,B) 32 numbers (one per sheet)
paper: 3x3 wide, 3 deep = 27 dials 3x3 wide, 32 deep = 288 dials
writes: ONE number per centre ONE number per centre
The shape is identical; only the depth grows. Memorise the shape, not the figures.
## Ironing the Grid Flat, Then the Old Clerks Finish
IN HAND: two floors + two bosses produced 8x8x64. Floor 1: 896 dials. Floor 2: each of 64
papers is 3x3x32 = 288 dials + 1 nudge = 289; 64 x 289 = 18,496 dials. Together the two
conv floors hold only 896 + 18,496 = 19,392 dials -- a bargain, because the papers slide.
This section irons the deep stack into a flat line so the plain clerk-rooms can finish.
After two floors and two bosses we hold 8 x 8 x 64 -- a small, deep stack of score-sheets.
Now we hand it to the plain sorting clerks from the earlier chapters, who read a flat line.
So we IRON the stack flat: 8 x 8 x 64 = 4,096 numbers in one long row. This is pure
re-shelving -- no dials, no arithmetic, just unfolding the grid back into a line (the exact
reverse of the fold we did at the start).
Then two ordinary clerk-floors finish the job, each clerk wired to every incoming number:
Dense(64, relu): 64 clerks, each reads all 4,096 numbers -> 64 numbers out
Dense(10, softmax): 10 clerks, each reads those 64 -> 10 raw scores
-> softmax -> 10 chances adding to 1 -> biggest is the guess
Softmax is the ten-bin exit from Chapter 8: raise e to each score, divide by the total, and
the ten results are chances that add to 1. The inspector and the clerk share one arithmetic
heart -- multiply-add -- but differ in habit: an inspector slides a tiny paper and cares WHERE
a shape sits; a clerk sits still, reads the whole line at once, and weighs all the evidence
with no regard for where it came from.
## Whole Machine, Size Flow, and Where the Dials Hide
Here is the entire factory and the size of the paper at every step:
step across x down x deep dials on this step
---------------------------------------------------------------------------
input grid 32 x 32 x 3 0
Conv2D(32, 3x3, same) 32 x 32 x 32 32 x (3x3x3 + 1) = 896
MaxPooling2D(2x2) 16 x 16 x 32 0
Conv2D(64, 3x3, same) 16 x 16 x 64 64 x (3x3x32 + 1) = 18,496
MaxPooling2D(2x2) 8 x 8 x 64 0
Flatten 4,096 0
Dense(64, relu) 64 4,096 x 64 + 64 = 262,208
Dense(10, softmax) 10 64 x 10 + 10 = 650
---------------------------------------------------------------------------
total = 282,250
Check the two convolution counts by hand. Floor 1: each of 32 papers has 3x3x3 = 27 dials
plus 1 nudge = 28, and 32 x 28 = 896. Floor 2: each of 64 papers has 3x3x32 = 288 dials plus
1 nudge = 289, and 64 x 289 = 18,496. Add everything: 896 + 18,496 + 262,208 + 650 = 282,250.
Now the punchline, and the real lesson of this whole machine. Of the 282,250 dials, 262,208 --
more than nine in ten -- sit in ONE place: the single Dense floor right after the flatten,
wiring 4,096 numbers to 64 clerks. The magic papers are astonishingly CHEAP by comparison
(896 and 18,496) because ONE small paper is reused at all 1,024 positions instead of needing
fresh dials per position. That reuse -- the same nine (or 288) dials slid everywhere -- is the
entire genius of a picture network. It is why a machine can look at a million-pixel image
without needing a billion dials: the eye is small and moves, rather than huge and fixed.
>> YOUR TURN
Suppose floor 1 hired 16 inspectors instead of 32 (still 3x3 papers, still colour photos).
How many dials would floor 1 hold?
check your slate:
Each paper still has 3x3x3 + 1 = 28. With 16 papers: 16 x 28 = 448 dials. Half the
inspectors, half the floor-1 dials -- and the papers stay cheap either way.
## Predicting the Grid Size by Hand
IN HAND: our factory's grid runs 32x32x3 -> 32x32x32 -> 16x16x32 -> 16x16x64 -> 8x8x64,
because every Conv2D carries padding='same' (the border of zeros that lets corner spots
be centres, so across-and-down never shrink at a paper step) and every boss halves.
This section shows the OTHER convention -- a bare paper with no border -- and how to
predict the size after every floor on pencil alone.
Whether a paper shrinks the grid is a CHOICE, named in the order, not a law:
padding='same' -- pad a border of zeros; across and down STAY the same at a Conv floor
no padding -- no border; the paper only sits where it fully fits, so the grid SHRINKS
With no border, a 3-wide paper cannot centre on the very edge column -- it would hang off.
So it starts one column in and stops one column early, losing one column at each end. The
shrink formula counts exactly that:
new size = old size - paper size + 1
a 3x3 paper on a 32x32 grid: 32 - 3 + 1 = 30 -> the grid becomes 30x30
The boss rule is unchanged either way: MaxPooling2D((2,2)) halves across and down (drop
any half-dot), depth untouched. Trace a no-padding factory all the way down:
after Conv1 (32 papers, no pad) 32 - 3 + 1 = 30 -> (30, 30, 32)
after Pool1 (boss halves) 30 / 2 = 15 -> (15, 15, 32)
after Conv2 (64 papers, no pad) 15 - 3 + 1 = 13 -> (13, 13, 64)
after Pool2 (boss halves) 13 / 2 = 6 -> ( 6, 6, 64) (drop the half)
Two traps worth flagging. First: a Conv floor does NOT shrink the grid by halving -- that
is the boss's job on the next line. With no padding a Conv floor shrinks only by the small
-2 the formula gives (30, not 16). Second: the depth after a Conv floor is the number of
inspectors hired there (32, then 64), never the colour count -- colour meaning died at
floor 1.
>> YOUR TURN
A bare 3x3 paper (no padding) scans a 28x28 grid. Then a 2x2 boss halves it. Give both
output sizes (ignore depth).
check your slate:
Conv: 28 - 3 + 1 = 26, so 26x26. Boss: 26 / 2 = 13, so 13x13.
## Common Tripwires
Real snags from building this, each fixed at the spot it bites.
!! WARN: ONE PAPER OUT IS ONE NUMBER, NOT A COLOUR TRIPLET
An inspector's paper is three deep (27 dials), but all 27 multiply-adds collapse to ONE
number per spot. People expect a red-guess, green-guess, blue-guess triplet. No -- one
number per spot per inspector. A whole floor of 32 inspectors gives 32 numbers per spot,
and those are 32 different shape-answers, not colours.
!! WARN: THE PAPER IS ALWAYS SMALL -- NEVER THE SHEET'S SIZE
On floor 2 the sheets are 16x16, but the paper is still 3x3, not 16x16. A 16x16 paper could
not move -- one position, no neighbourhoods, no scan. Only the paper's DEPTH grows to match
the incoming sheets; its width stays the typed (3,3).
!! WARN: POOLING DOES NOT KEEP COORDINATES
The shrink boss writes a fresh, smaller, FULL sheet -- no holes, no kept dot-addresses.
Each 2x2 block writes its loudest value at the block's position. Which corner it came from
is deliberately forgotten. You are not meant to track where the survivor sat.
!! WARN: COUNT THE PAPER DEPTH FROM WHAT ARRIVES, NOT WHAT YOU TYPE
Conv2D(64, (3,3)) on a 16x16x32 stack makes papers that are 3x3x32 = 288 dials each, not
3x3 = 9. The depth (32) is invisible in the code because it auto-matches the input. Forget
it and your hand dial-count comes out wildly too low.
!! WARN: FLATTEN HAS NO DIALS
Flatten is re-shelving, not arithmetic. It moves zero dials. If your hand count attributes
any dials to the flatten step, you have miscounted -- the dials are all in the Dense floor
that READS the flattened line.
## Standard Names for Everything Above
Plain term used above Standard label
---------------------------------- -------------------------------------------
magic paper filter / kernel
inspector a single convolutional filter's output channel
a floor of inspectors a convolutional layer (Conv2D)
sliding the paper one spot at a time the convolution / stride 1
border of zeros to reach the edge padding='same'
score-sheet feature map
shrink boss, keep the loudest of 4 max pooling (MaxPooling2D)
ironing the grid flat Flatten
a clerk wired to every number a Dense (fully-connected) neuron
ten scores into ten chances softmax
depth auto-matches the input channel dimension
one small paper reused everywhere weight sharing / parameter sharing
## Code, If You Want It
The whole factory, the same steps spoken in Keras. Nothing here needs a machine to follow
-- the section above did it all by hand -- but here is the day you meet one.
>> NEW TO PYTHON? Each named once:
reshape(-1, 32, 32, 3) -- fold the flat 3072 into a 32x32x3 grid
Conv2D(32, (3,3), padding='same') -- a floor of 32 inspectors, 3x3 papers, edges reachable
MaxPooling2D((2,2)) -- the shrink boss: keep the loudest of each 2x2 block
Flatten() -- iron the deep grid into one flat line
Dense(64, activation='relu') -- 64 plain clerks reading the whole line
Dense(10, activation='softmax') -- ten-chance exit, chances add to 1
count_params() -- the machine's own dial count (should be 282,250)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model_cnn = Sequential([
Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3)),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu', padding='same'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dense(10, activation='softmax'), # 10 bins, chances add to 1
])
model_cnn.compile(
loss='sparse_categorical_crossentropy', # answer key is ONE integer 0..9
optimizer='adam',
metrics=['accuracy'],
)
print(model_cnn.count_params()) # 282,250 -- matches our hand count
!! WARN: input_shape ONLY ON THE FIRST FLOOR
The first Conv2D must be told the grid size, input_shape=(32, 32, 3). The later floors
work out their own input size from what the floor below hands them -- never type it twice.
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 9 -- Machines That Look at Pictures):
Part 1 (this post)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================