==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 10 . MACHINES THAT READ WORDS . PART 1 OF 2
Words Into a Machine: The Notepad and the Walking Worker
Posted: 2026-06-13 . Author: Rahul Rai . Tags: rnn, embedding, nlp, sequence-models
============================================================================================
PATH . post 27 of 28
<- prev: Chapter 9, Part 2: The Deep Factory
next: Chapter 10, Part 2: The Two-Memory Worker ->
A brand-new pile. Chapter 9's photos were ALREADY numbers -- a pixel is a brightness from
0 to 1, ready to eat. This pile is WORDS: movie reviews, each marked 1 (the writer liked
it) or 0 (did not). The goal is the old goal in new clothes -- read a NEW review, guess
1 or 0. One yes/no per review. (The textbook calls this "sentiment": thumbs up or down.)
The wall is right at the door. A factory of clerks and dials can only multiply NUMBERS.
It cannot multiply the word "boring". So the whole front of this post is one job: turn
words into numbers WITHOUT throwing away what the words mean. Then we build the first
text-factory -- three floors -- and walk one review through the middle floor by hand,
one word at a time, until a running memory falls out the end.
Pencil and scratch paper out. Every number here is recomputed where it is needed.
## The New Sheet, and the Wall
The sheet:
text (a movie review, in words) answer
--------------------------------------------------- ------
"complete waste of my evening" 0 (did not like it)
"an absolute masterpiece, loved every minute" 1 (liked it)
... (thousands of reviews)
Each ROW is one review. The answer column is 1 or 0. We want a rule that reads the words
of a new review and guesses the 1 or 0.
The wall, stated plainly against Chapter 9:
Chapter 9 photos were ALREADY numbers (pixels 0..1, ready to eat)
Chapter 10 reviews are WORDS ("boring", "masterpiece") -- NOT numbers
a factory of clerks + dials can only multiply NUMBERS, not the word "boring"
So: number the words first. Three moves -- a numbering, a length-fix, and a split.
## Words Are Not Numbers, So Hand Each One a Number (Q1, Q2a-b)
Walk every review in the pile and COUNT how often each word appears. (This is the oldest
trick in the book -- count every word in a giant pile of text. The most-common word gets
the smallest number.)
"the" appears most -> 1
"movie" next -> 2
"was" -> 3
"boring" -> 4
...
Now a CHOICE (Charter Law 7 -- name it as a choice, not a fact): keep only the 10,000
most-common words. Every rarer word -- a typo, a name, weird slang seen twice in the
whole pile -- is dumped into ONE shared bin we will call <unknown>, number 0... actually
we reserve 0 for "empty" (next move), so <unknown> gets its own number. The point stands:
reviews in the pile = maybe 50,000 <- ROWS of the sheet
distinct words KEPT = 10,000 <- a knob WE set (call it VOCAB_SIZE)
every rarer word -> one shared <unknown> number
Why toss the rare words? A word seen twice in 50,000 reviews cannot teach the machine
anything reliable -- there is no pattern in two sightings. The 10,000 is not sacred; it
is a knob. Bigger keeps more vocabulary and costs more dials; smaller is leaner.
WATCH THE TRAP: 10,000 is how many distinct WORDS we keep. It is NOT the number of
reviews. Reviews are the rows (maybe 50,000); kept words are the vocabulary (10,000).
Two different counts that both happen to be in the tens of thousands.
After this, the review "boring movie the" becomes [4, 2, 1] -- three numbers, one
per word, each number just a NAME TAG pointing at a word.
## A Fixed Factory Needs One Length, So Pad Every Review (Q2c)
IN HAND: every word now has a number (the=1, boring=4, ...); 10,000 words kept, the rest
pooled as <unknown>. A review is now a row of word-numbers of WHATEVER length it was.
Here is the problem. The factory has a FIXED count of dials. It cannot eat an 8-number
review and a 200-number review with the same dials -- the shapes do not match. So force
every review to exactly the same length. Call it MAX_LEN = 100 numbers (a knob).
review too LONG (200 words) -> chop everything past 100 (keep the first 100)
review too SHORT (8 words) -> fill the end with 0s up to 100
Worked tiny example. "boring film" -> [4, 17] -- two real numbers. Pad to length 6:
[4, 17, 0, 0, 0, 0] <- 2 real word-numbers, then 4 zeros
The 0 means "empty slot, nothing here." That is why 0 was reserved and never handed to a
real word. After this EVERY review is a line of exactly 100 numbers. Now the factory
can eat them, all the same shape.
>> YOUR TURN
MAX_LEN = 5. The review "the movie was boring" numbers to [1, 2, 3, 4].
What is the padded row? And what does "the the the the the the the" (7 words) become?
check your slate:
[1, 2, 3, 4] has 4 numbers, need 5 -> pad one zero -> [1, 2, 3, 4, 0].
"the" x7 = [1,1,1,1,1,1,1], 7 numbers, need 5 -> chop to first 5 -> [1, 1, 1, 1, 1].
## Never Grade on Studied Cards, So Cut Study From Exam (Q2d)
Same honest rule as every chapter: never grade the machine on cards it studied. Cut 80%
to study from, seal 20% as the exam. The exam reviews are not looked at until the very
end. (Earlier chapters split THREE ways -- study, practice, exam. This lab uses the test
pile directly as its exam, so two piles here. The principle is identical: the graded pile
is sealed.)
After all three steps: every review = a line of exactly 100 word-numbers, the pile cut
into a study 80% and a sealed exam 20%. The factory can finally be built.
## The First Text-Factory: Three Floors (Q3)
IN HAND: every review is now 100 word-numbers (e.g. [4, 73, 1, 87, 0, 0, ...]), 0 = empty.
We feed that line of 100 numbers into a factory with three floors:
100 word-numbers in
|
floor 1: NOTEPAD swap each word-number for a row of meaning-numbers
|
floor 2: ONE WORKER read the rows IN ORDER, carry a running memory
|
floor 3: FINAL CLERK read the last memory -> one chance, liked (1) / not (0)
Each floor gets its own section. Floor 1 fixes the name-tag problem; floor 2 is the heart
of the chapter; floor 3 is the same S-curve clerk from Chapter 7.
## Floor 1: The Notepad (Embedding)
The word-number is just a NAME TAG. boring=4, masterpiece=73 -- the size of the number
means NOTHING. Feed it raw to a clerk and masterpiece (73) would count 73 times heavier
than "the" (1), which is pure nonsense. A name tag is not a measurement.
So instead of feeding the bare number, jot a NOTE about each word -- a little row of
real numbers that says what the word is like. Start with one number per word, a LEAN:
"boring" appeared in 1000 reviews, 900 of them rated 0 -> note: leans DOWN
"masterpiece" -> note: leans UP
"the" shows up everywhere, no lean -> note: ~middle (useless)
One number per word = a notepad 10,000 rows tall and 1 wide. That single LEAN number is
already an embedding, at width 1.
But one number per word is too thin. Here is the reason, and it is the whole reason the
note is WIDE: the word "not" does not lean up or down on its own -- its whole job is to
FLIP the next word. "not boring" is a compliment. A single lean-number cannot say "I am a
flipper." A word needs several slots:
"boring" note: lean = down, flipper = no, amplifier = no, ...
"not" note: lean = none, flipper = YES, amplifier = no, ...
"very" note: lean = none, flipper = no, amplifier = YES, ...
So make each note 32 numbers wide (call the width EMBEDDING_DIM = 32, a knob). The notepad
is now:
10,000 word-rows x 32 numbers each = 320,000 dials
The word-number just says WHICH ROW to read. Word 4 -> go to row 4 -> copy its 32 numbers.
Two things people get wrong about the notepad:
(1) The note slots are NOT labelled by a human. Nobody writes "slot 3 = flipper." The
notepad starts as 320,000 random junk numbers. The machine FILLS it itself, by the same
dial-turning as every other floor -- nudged by wrongness over many study reviews until
"boring" and "dull" drift to similar notes and "masterpiece" sits far away. The notepad
is WORKSPACE the machine shapes, exactly like Chapter 9's dials -- it is NOT read from
the data.
(2) The notepad is dials, so it counts. 320,000 of them. That single floor already holds
more dials than the rest of the factory combined, as we will see.
One review through floor 1:
[4, 73, 1, 87, 0, ...] -> look up each number's row -> stack 100 notes
-> a SHEET 100 rows tall x 32 wide
A picture (showing 4 of the 100 rows, 5 of the 32 columns):
word-number note (32 numbers, only 5 shown)
--------- --------------------------------
4 boring [ -0.8 0.1 0.0 0.2 -0.3 ... ]
73 master [ 0.9 -0.1 0.4 0.0 0.7 ... ]
1 the [ 0.0 0.0 -0.1 0.0 0.1 ... ]
87 film [ 0.2 0.3 0.0 -0.2 0.0 ... ]
0 EMPTY [ 0.0 0.0 0.0 0.0 0.0 ... ] (slot 0 stays all-zero)
That stacked sheet -- 100 rows of 32 -- is what floor 2 reads.
>> YOUR TURN
Notepad width 4 (not 32). Three words kept. "good" = row 2 with note
[0.9, 0.0, 0.1, -0.2]. A review numbers to [2, 0]. What sheet does floor 1 produce,
and how tall x wide is it if MAX_LEN = 3?
check your slate: MAX_LEN 3 means pad [2, 0] to [2, 0, 0]. Look up each:
row 2 -> [0.9, 0.0, 0.1, -0.2]
row 0 -> [0.0, 0.0, 0.0, 0.0] (empty)
row 0 -> [0.0, 0.0, 0.0, 0.0] (empty)
The sheet is 3 rows tall x 4 wide.
## Floor 2: The Walking Worker (the RNN Cell, by Hand)
IN HAND: floor 1 turned the review into a sheet of 100 notes, each note 32 numbers wide.
Floor 2 must boil that whole sheet down to ONE summary of the review. This is the heart
of the chapter, so we walk it by hand, one word at a time.
WHY A MEMORY AT ALL. "not boring" and "boring not" are made of the SAME two words, yet
they mean opposite things. ORDER carries meaning. To let "not" flip "boring", the machine
must still REMEMBER it saw "not" when it reaches "boring":
read "not" -> memory: "a flip is pending"
read "boring" -> boring leans down, BUT a flip is pending -> push the verdict UP
So floor 2 keeps a MEMORY: a small row of 32 numbers that it rewrites after every word.
Think of two pockets on the worker's apron:
pocket A = the MEMORY (32 numbers; before the first word it is ALL ZEROS)
pocket B = the DIALS (frozen during one read-through; only change between
training loops)
The dials in pocket B are two papers and a nudge, REUSED at every single word:
word-dials = a 32 x 32 paper (for the current word's 32 note-numbers)
memory-dials = a 32 x 32 paper (for the old memory's 32 numbers)
nudge = a row of 32 numbers
The word-dials have nothing to do with any particular word -- they are general, the same
for "boring" as for "the". Now the walk.
WORD 1 -- "nolan" (first word, so the old memory is all zeros):
1. see "nolan" -> read its row in the notepad -> pull its 32 numbers
2. WORD-part: nolan's 32 x word-dials (32x32) -> 32 numbers
(row 1 of the paper, times nolan's 32, added up -> new number 1;
row 2 times the same 32 -> number 2; ... 32 rows -> 32 numbers)
3. MEMORY-part: old memory (all zeros) x memory-dials -> all zeros
(anything times zero is zero -- the first word has nothing behind it)
4. ADD them: word-part (32) + memory-part (zeros) -> 32 numbers
5. + NUDGE: add the nudge (32 numbers) -> 32 numbers
6. SQUASH: push each through tanh (crush to between -1 and +1) -> 32 numbers
That result is the NEW memory. Put it in pocket A, replacing the zeros.
(tanh is the squash from Chapter 7's family -- it crushes any number into the band -1 to
+1: a big positive -> near +1, a big negative -> near -1, zero -> 0. One crush, at the end.)
WORD 2 -- "ended" (now the old memory is NOT zeros):
1. see "ended" -> notepad -> its 32 numbers
2. WORD-part: ended's 32 x the SAME word-dials -> 32
3. MEMORY-part: pocket A (nolan's leftover memory) x the SAME memory-dials -> 32
(this time NOT zeros, so the past actually contributes)
4. add + nudge + tanh -> the newer memory
5. put it back in pocket A
Same dials as word 1 -- reused. That reuse -- one dial-set looping back over word after
word -- is the whole idea, and it is why the textbook calls this RECURRENT (recurring:
the same dials come round again every word). The one recipe to memorise:
new memory = tanh( word's 32 x word-dials + old memory x memory-dials + nudge )
\__ this word folded in __/ \__ the past carried forward _/
Walk all 100 words this way. After the last word, pocket A holds a 32-number SUMMARY of
the whole review, with order baked in.
ONE WORKER, NOT ONE-PER-WORD. The single most common wrong picture:
NOT 100 workers, one per word
YES 1 worker, who walks word -> word -> word 100 times, SAME dials every word
This is exactly Chapter 9's reuse in a new suit: there, one inspector slid his ONE magic
paper to every spot on the photo; here, one worker slides his ONE dial-set across all 100
words. Because the dials are reused, a 100-word review and a 10-word review need the SAME
dials -- one worker, one dial-set, just more steps for the longer review.
How this differs from Chapter 9's inspector:
inspector (Ch.9): each spot is independent -- no memory between spots
worker (RNN): carries memory FORWARD -- word 2 uses word 1's leftover memory
The dial count for this floor, by hand:
word-dials 32 x 32 = 1,024
memory-dials 32 x 32 = 1,024
nudge 32 = 32
total = 2,080
(Keras reports a SimpleRNN(32) sitting on a 32-wide note as exactly 2,080 -- it counts
(32 + 32) x 32 + 32. Same arithmetic, same answer.)
>> YOUR TURN
Memory width 2 (not 32). Old memory = [0, 0] (first word). The word's 2 numbers are
[1, 2]. word-dials = [[1, 0], [0, 1]] (the do-nothing paper), memory-dials anything,
nudge = [0, 0]. tanh(1) ~ 0.76, tanh(2) ~ 0.96. What is the new memory?
check your slate:
WORD-part: row1 . [1,2] = 1x1 + 0x2 = 1 ; row2 . [1,2] = 0x1 + 1x2 = 2 -> [1, 2]
MEMORY-part: old memory is [0,0], so anything x it = [0, 0]
add + nudge: [1, 2] + [0, 0] + [0, 0] = [1, 2]
squash: [tanh(1), tanh(2)] ~ [0.76, 0.96]. New memory ~ [0.76, 0.96].
## Why the Loop Needs the tanh Squash
IN HAND: the worker rewrites its 32-number memory once per word, 100 times, each time
ending with tanh. Why end with a squash at all? Here is the problem, made visible.
A squash is a FIXED curve that bends ONE number into a tidy range, alone -- no dials,
nothing learned, no looking at neighbours. Three squashes appear across this book:
relu (Chapter 7/9 clerks): negative -> 0, keep positive range 0 to infinity
tanh (this worker's memory): crush range -1 to +1
sigmoid (the final clerk): crush range 0 to 1
tanh by the numbers: 0 -> 0 ; 1 -> 0.76 ; 2 -> 0.96 ; 5 -> ~1.0 ; -5 -> ~-1.0
Now the problem. Take the memory and multiply it by the memory-dials 100 times in a row
with NO squash between. Suppose one slot's dial run multiplies by 1.5 each word:
1.5 x 1.5 x ... (100 times) = 1.5^100 = a number with 17 zeros -- it EXPLODES.
And if the dial run multiplies by 0.5 each word instead:
0.5^100 = essentially zero -- the memory DIES.
A 100-step loop with no crusher either blows up to nonsense or collapses to nothing. The
tanh squash pins every memory number between -1 and +1 at EVERY word, so after 100 words
the memory is still a tidy bounded row. The squash is what keeps the long loop polite --
it is not decoration; without it the worker is unusable.
ONE THING THE SQUASH IS NOT: a humbler. Chapter 9's humbler (batch norm) also keeps
numbers in check, but it PEEKS at the whole handful of 64 to re-centre, it has learned
dials (stretch, shift), and it keeps a diary. The tanh squash peeks at no one, learns
nothing, keeps no diary -- it bends each number through the same fixed curve, alone, every
time. Humbler = learned, crowd-aware re-centring. Squash = dumb, fixed, solo crusher.
## Floor 3: The Final Clerk (Dense + S-curve)
IN HAND: floor 2 walked all 100 words and left a 32-number summary in pocket A.
One clerk reads those 32 numbers, multiplies each by its dial, adds them and a nudge into
one running total, then squashes that total through the S-curve into a chance between 0
and 1 (the same S-curve derived from odds in Chapter 7):
big positive total -> near 1 (liked)
zero total -> 0.5 (on the fence)
big negative total -> near 0 (did not like)
Dials here: 32 (one per memory-number) + 1 nudge = 33.
The whole factory's dial count, recomputed:
floor 1 notepad 10,000 x 32 = 320,000
floor 2 walking worker (above) = 2,080
floor 3 final clerk 32 + 1 = 33
total = 322,113
Just as in Chapter 9's factory, one floor hogs the dials -- there it was the Dense floor,
here it is the notepad (320,000 of 322,113, about 99%). The famous "recurrent" worker is
a rounding error in the dial budget. The cost is not in the dials; it is in the WALK --
100 steps per review, in order, one after another, which is the bottleneck Part 2's
cousins inherit and the Transformer (a later chapter) finally breaks.
## Common Tripwires
Built from the live lab session -- every confusion actually hit.
TRIPWIRE 1 "10,000 is the number of reviews."
No. 10,000 is the number of distinct WORDS kept (the vocabulary, a knob). The reviews
are the ROWS -- maybe 50,000. Two different counts.
TRIPWIRE 2 "The word-number is a measurement -- bigger means more."
No. It is a NAME TAG. boring=4, masterpiece=73 carry no size-meaning; 73 is not "bigger"
than 4 in any useful sense. That is exactly why floor 1 swaps the tag for a learned note.
TRIPWIRE 3 "The notepad comes from the data."
No. The notepad starts as random junk and the machine FILLS it itself by dial-turning,
pulling similar words close. It is workspace, like every other dial -- not a lookup
table read off the reviews.
TRIPWIRE 4 "Floor 2 is 100 workers, one per word."
No. ONE worker, walking 100 steps, reusing the SAME dial-set every step. That reuse is
what 'recurrent' means, and it is why a long review and a short review share one dial-set.
TRIPWIRE 5 "Why carry a memory? Just average the word-notes."
Averaging throws away ORDER, and order is the whole point. "not boring" and "boring not"
average to the identical thing; only a memory that remembers "not" when it reaches
"boring" can tell them apart.
TRIPWIRE 6 "Padding with zeros adds fake words the machine learns from."
The 0 slot is reserved as EMPTY and its notepad row is kept all-zero, so a padded slot
contributes a row of zeros -- nothing. It only makes every review the same shape so the
fixed dials can eat them.
TRIPWIRE 7 "Embedding width 1 (just a lean) would do."
One number cannot encode "I am a flipper" (not) or "I am an amplifier" (very) separately
from a lean. Width 32 gives the word room to be several things at once. Width is a knob.
## The Labels, Last
pile of words, 1/0 per row sentiment classification (binary)
number each word by frequency tokenisation by frequency rank
10,000 kept words vocabulary size (VOCAB_SIZE / num_words)
shared rare-word bin <unknown> / out-of-vocabulary (OOV) token
same-length fix padding / truncating (pad_sequences)
empty slot = 0 pad token (id 0), mask
MAX_LEN = 100 sequence length
notepad embedding layer
one note (32 numbers) word vector / embedding vector
note width 32 embedding dimension (EMBEDDING_DIM)
walking worker, one memory SimpleRNN (recurrent layer)
pocket A (memory) hidden state h_t
word-dials / memory-dials input weights W / recurrent weights U
reuse same dials each word weight sharing across time steps
crush to -1..+1 tanh activation
final clerk + S-curve Dense(1) + sigmoid
wrongness for yes/no binary cross-entropy
downhill-roller Adam optimiser
## Code, If You Want It
Nothing above needed a computer; this section is for the day you meet one.
Open the pile and check the balance (Q1):
```python
df = load_data(DATA_PATH) # CSV with columns 'text' and 'label'
q1_shape = df.shape # (how many reviews, 2)
q1_dist = df['label'].value_counts().sort_index() # how many 0s, how many 1s
```
Number the words, cap the vocabulary, pad to one length, split (Q2):
```python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
VOCAB_SIZE = 10000 # keep the 10,000 most common words (a knob)
MAX_LEN = 100 # force every review to 100 numbers (a knob)
tok = Tokenizer(num_words=VOCAB_SIZE, oov_token="<unk>")
tok.fit_on_texts(df['text']) # count words, rank by frequency
seqs = tok.texts_to_sequences(df['text']) # words -> name-tag numbers
X = pad_sequences(seqs, maxlen=MAX_LEN,
padding='post', truncating='post') # 0 = empty slot
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RANDOM_STATE)
```
Build the three-floor text-factory (Q3):
```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, SimpleRNN, Dense
EMBEDDING_DIM = 32
model_rnn = Sequential()
model_rnn.add(Input(shape=(MAX_LEN,))) # 100 word-numbers in
model_rnn.add(Embedding(input_dim=VOCAB_SIZE,
output_dim=EMBEDDING_DIM)) # floor 1: the notepad
model_rnn.add(SimpleRNN(32)) # floor 2: walking worker
model_rnn.add(Dense(1, activation="sigmoid")) # floor 3: final yes/no
model_rnn.compile(optimizer="adam",
loss="binary_crossentropy", metrics=["accuracy"])
```
- `binary_crossentropy` = wrongness for a yes/no guesser: -log(the chance given to the
true answer). Truth 1, said 0.9 -> tiny; said 0.1 -> huge. (Chapter 7 derives it.)
- `adam` = the downhill-roller from Chapter 7, Part 2.
- `model_rnn.summary()` prints 322,113 total dials -- 320,000 in the notepad alone.
Training this factory, watching it forget far-back words, and the two-memory fix that
rescues it (the LSTM) are Part 2.
--> Continue: Chapter 10, Part 2: The Two-Memory Worker
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 10 -- Machines That Read Words):
Part 1 (this post) .
Part 2 -- The Two-Memory Worker
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================