Words Into a Machine: The Notepad and the Walking Worker (RNNs by Hand, Part 1)

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 10 . MACHINES THAT READ WORDS . PART 1 OF 2
  Words Into a Machine: The Notepad and the Walking Worker
  Posted: 2026-06-13 . Author: Rahul Rai . Tags: rnn, embedding, nlp, sequence-models
  ============================================================================================

  PATH . post 27 of 28
    <- prev:  Chapter 9, Part 2: The Deep Factory
       next:  Chapter 10, Part 2: The Two-Memory Worker ->

  A brand-new pile. Chapter 9's photos were ALREADY numbers -- a pixel is a brightness from
  0 to 1, ready to eat. This pile is WORDS: movie reviews, each marked 1 (the writer liked
  it) or 0 (did not). The goal is the old goal in new clothes -- read a NEW review, guess
  1 or 0. One yes/no per review. (The textbook calls this "sentiment": thumbs up or down.)

  The wall is right at the door. A factory of clerks and dials can only multiply NUMBERS.
  It cannot multiply the word "boring". So the whole front of this post is one job: turn
  words into numbers WITHOUT throwing away what the words mean. Then we build the first
  text-factory -- three floors -- and walk one review through the middle floor by hand,
  one word at a time, until a running memory falls out the end.

  Pencil and scratch paper out. Every number here is recomputed where it is needed.


  ## The New Sheet, and the Wall

  The sheet:

    text (a movie review, in words)                        answer
    ---------------------------------------------------    ------
    "complete waste of my evening"                           0   (did not like it)
    "an absolute masterpiece, loved every minute"            1   (liked it)
    ...  (thousands of reviews)

  Each ROW is one review. The answer column is 1 or 0. We want a rule that reads the words
  of a new review and guesses the 1 or 0.

  The wall, stated plainly against Chapter 9:

    Chapter 9 photos were ALREADY numbers (pixels 0..1, ready to eat)
    Chapter 10 reviews are WORDS ("boring", "masterpiece") -- NOT numbers
    a factory of clerks + dials can only multiply NUMBERS, not the word "boring"

  So: number the words first. Three moves -- a numbering, a length-fix, and a split.


  ## Words Are Not Numbers, So Hand Each One a Number (Q1, Q2a-b)

  Walk every review in the pile and COUNT how often each word appears. (This is the oldest
  trick in the book -- count every word in a giant pile of text. The most-common word gets
  the smallest number.)

    "the"   appears most           -> 1
    "movie" next                   -> 2
    "was"                          -> 3
    "boring"                       -> 4
    ...

  Now a CHOICE (Charter Law 7 -- name it as a choice, not a fact): keep only the 10,000
  most-common words. Every rarer word -- a typo, a name, weird slang seen twice in the
  whole pile -- is dumped into ONE shared bin we will call <unknown>, number 0... actually
  we reserve 0 for "empty" (next move), so <unknown> gets its own number. The point stands:

    reviews in the pile        = maybe 50,000     <- ROWS of the sheet
    distinct words KEPT        = 10,000           <- a knob WE set (call it VOCAB_SIZE)
    every rarer word           -> one shared <unknown> number

  Why toss the rare words? A word seen twice in 50,000 reviews cannot teach the machine
  anything reliable -- there is no pattern in two sightings. The 10,000 is not sacred; it
  is a knob. Bigger keeps more vocabulary and costs more dials; smaller is leaner.

  WATCH THE TRAP: 10,000 is how many distinct WORDS we keep. It is NOT the number of
  reviews. Reviews are the rows (maybe 50,000); kept words are the vocabulary (10,000).
  Two different counts that both happen to be in the tens of thousands.

  After this, the review "boring movie the" becomes [4, 2, 1] -- three numbers, one
  per word, each number just a NAME TAG pointing at a word.


  ## A Fixed Factory Needs One Length, So Pad Every Review (Q2c)

  IN HAND: every word now has a number (the=1, boring=4, ...); 10,000 words kept, the rest
  pooled as <unknown>. A review is now a row of word-numbers of WHATEVER length it was.

  Here is the problem. The factory has a FIXED count of dials. It cannot eat an 8-number
  review and a 200-number review with the same dials -- the shapes do not match. So force
  every review to exactly the same length. Call it MAX_LEN = 100 numbers (a knob).

    review too LONG (200 words)  -> chop everything past 100      (keep the first 100)
    review too SHORT (8 words)   -> fill the end with 0s up to 100

  Worked tiny example. "boring film" -> [4, 17] -- two real numbers. Pad to length 6:

    [4, 17, 0, 0, 0, 0]      <- 2 real word-numbers, then 4 zeros

  The 0 means "empty slot, nothing here." That is why 0 was reserved and never handed to a
  real word. After this EVERY review is a line of exactly 100 numbers. Now the factory
  can eat them, all the same shape.

    >> YOUR TURN
       MAX_LEN = 5. The review "the movie was boring" numbers to [1, 2, 3, 4].
       What is the padded row? And what does "the the the the the the the" (7 words) become?

       check your slate:
         [1, 2, 3, 4] has 4 numbers, need 5 -> pad one zero -> [1, 2, 3, 4, 0].
         "the" x7 = [1,1,1,1,1,1,1], 7 numbers, need 5 -> chop to first 5 -> [1, 1, 1, 1, 1].


  ## Never Grade on Studied Cards, So Cut Study From Exam (Q2d)

  Same honest rule as every chapter: never grade the machine on cards it studied. Cut 80%
  to study from, seal 20% as the exam. The exam reviews are not looked at until the very
  end. (Earlier chapters split THREE ways -- study, practice, exam. This lab uses the test
  pile directly as its exam, so two piles here. The principle is identical: the graded pile
  is sealed.)

  After all three steps: every review = a line of exactly 100 word-numbers, the pile cut
  into a study 80% and a sealed exam 20%. The factory can finally be built.


  ## The First Text-Factory: Three Floors (Q3)

  IN HAND: every review is now 100 word-numbers (e.g. [4, 73, 1, 87, 0, 0, ...]), 0 = empty.
  We feed that line of 100 numbers into a factory with three floors:

    100 word-numbers in
      |
    floor 1: NOTEPAD      swap each word-number for a row of meaning-numbers
      |
    floor 2: ONE WORKER   read the rows IN ORDER, carry a running memory
      |
    floor 3: FINAL CLERK  read the last memory -> one chance, liked (1) / not (0)

  Each floor gets its own section. Floor 1 fixes the name-tag problem; floor 2 is the heart
  of the chapter; floor 3 is the same S-curve clerk from Chapter 7.


  ## Floor 1: The Notepad (Embedding)

  The word-number is just a NAME TAG. boring=4, masterpiece=73 -- the size of the number
  means NOTHING. Feed it raw to a clerk and masterpiece (73) would count 73 times heavier
  than "the" (1), which is pure nonsense. A name tag is not a measurement.

  So instead of feeding the bare number, jot a NOTE about each word -- a little row of
  real numbers that says what the word is like. Start with one number per word, a LEAN:

    "boring"      appeared in 1000 reviews, 900 of them rated 0  ->  note: leans DOWN
    "masterpiece"                                                ->  note: leans UP
    "the"         shows up everywhere, no lean                   ->  note: ~middle (useless)

  One number per word = a notepad 10,000 rows tall and 1 wide. That single LEAN number is
  already an embedding, at width 1.

  But one number per word is too thin. Here is the reason, and it is the whole reason the
  note is WIDE: the word "not" does not lean up or down on its own -- its whole job is to
  FLIP the next word. "not boring" is a compliment. A single lean-number cannot say "I am a
  flipper." A word needs several slots:

    "boring"  note:  lean = down,  flipper = no,   amplifier = no,   ...
    "not"     note:  lean = none,  flipper = YES,  amplifier = no,   ...
    "very"    note:  lean = none,  flipper = no,   amplifier = YES,  ...

  So make each note 32 numbers wide (call the width EMBEDDING_DIM = 32, a knob). The notepad
  is now:

    10,000 word-rows  x  32 numbers each  =  320,000 dials

  The word-number just says WHICH ROW to read. Word 4 -> go to row 4 -> copy its 32 numbers.

  Two things people get wrong about the notepad:

  (1) The note slots are NOT labelled by a human. Nobody writes "slot 3 = flipper." The
  notepad starts as 320,000 random junk numbers. The machine FILLS it itself, by the same
  dial-turning as every other floor -- nudged by wrongness over many study reviews until
  "boring" and "dull" drift to similar notes and "masterpiece" sits far away. The notepad
  is WORKSPACE the machine shapes, exactly like Chapter 9's dials -- it is NOT read from
  the data.

  (2) The notepad is dials, so it counts. 320,000 of them. That single floor already holds
  more dials than the rest of the factory combined, as we will see.

  One review through floor 1:

    [4, 73, 1, 87, 0, ...]  ->  look up each number's row  ->  stack 100 notes
                            ->  a SHEET 100 rows tall x 32 wide

  A picture (showing 4 of the 100 rows, 5 of the 32 columns):

    word-number   note (32 numbers, only 5 shown)
    ---------     --------------------------------
       4  boring  [ -0.8   0.1   0.0   0.2  -0.3  ... ]
      73  master  [  0.9  -0.1   0.4   0.0   0.7  ... ]
       1  the     [  0.0   0.0  -0.1   0.0   0.1  ... ]
      87  film    [  0.2   0.3   0.0  -0.2   0.0  ... ]
       0  EMPTY   [  0.0   0.0   0.0   0.0   0.0  ... ]    (slot 0 stays all-zero)

  That stacked sheet -- 100 rows of 32 -- is what floor 2 reads.

    >> YOUR TURN
       Notepad width 4 (not 32). Three words kept. "good" = row 2 with note
       [0.9, 0.0, 0.1, -0.2]. A review numbers to [2, 0]. What sheet does floor 1 produce,
       and how tall x wide is it if MAX_LEN = 3?

       check your slate: MAX_LEN 3 means pad [2, 0] to [2, 0, 0]. Look up each:
         row 2  -> [0.9, 0.0, 0.1, -0.2]
         row 0  -> [0.0, 0.0, 0.0, 0.0]   (empty)
         row 0  -> [0.0, 0.0, 0.0, 0.0]   (empty)
       The sheet is 3 rows tall x 4 wide.


  ## Floor 2: The Walking Worker (the RNN Cell, by Hand)

  IN HAND: floor 1 turned the review into a sheet of 100 notes, each note 32 numbers wide.
  Floor 2 must boil that whole sheet down to ONE summary of the review. This is the heart
  of the chapter, so we walk it by hand, one word at a time.

  WHY A MEMORY AT ALL. "not boring" and "boring not" are made of the SAME two words, yet
  they mean opposite things. ORDER carries meaning. To let "not" flip "boring", the machine
  must still REMEMBER it saw "not" when it reaches "boring":

    read "not"     ->  memory: "a flip is pending"
    read "boring"  ->  boring leans down, BUT a flip is pending  ->  push the verdict UP

  So floor 2 keeps a MEMORY: a small row of 32 numbers that it rewrites after every word.
  Think of two pockets on the worker's apron:

    pocket A = the MEMORY        (32 numbers; before the first word it is ALL ZEROS)
    pocket B = the DIALS         (frozen during one read-through; only change between
                                  training loops)

  The dials in pocket B are two papers and a nudge, REUSED at every single word:

    word-dials    = a 32 x 32 paper   (for the current word's 32 note-numbers)
    memory-dials  = a 32 x 32 paper   (for the old memory's 32 numbers)
    nudge         = a row of 32 numbers

  The word-dials have nothing to do with any particular word -- they are general, the same
  for "boring" as for "the". Now the walk.

  WORD 1 -- "nolan" (first word, so the old memory is all zeros):

    1. see "nolan" -> read its row in the notepad -> pull its 32 numbers
    2. WORD-part:    nolan's 32  x  word-dials (32x32)        -> 32 numbers
                     (row 1 of the paper, times nolan's 32, added up -> new number 1;
                      row 2 times the same 32 -> number 2; ... 32 rows -> 32 numbers)
    3. MEMORY-part:  old memory (all zeros)  x  memory-dials  -> all zeros
                     (anything times zero is zero -- the first word has nothing behind it)
    4. ADD them:     word-part (32)  +  memory-part (zeros)   -> 32 numbers
    5. + NUDGE:      add the nudge (32 numbers)               -> 32 numbers
    6. SQUASH:       push each through tanh (crush to between -1 and +1) -> 32 numbers

  That result is the NEW memory. Put it in pocket A, replacing the zeros.

  (tanh is the squash from Chapter 7's family -- it crushes any number into the band -1 to
  +1: a big positive -> near +1, a big negative -> near -1, zero -> 0. One crush, at the end.)

  WORD 2 -- "ended" (now the old memory is NOT zeros):

    1. see "ended" -> notepad -> its 32 numbers
    2. WORD-part:    ended's 32  x  the SAME word-dials       -> 32
    3. MEMORY-part:  pocket A (nolan's leftover memory)  x  the SAME memory-dials  -> 32
                     (this time NOT zeros, so the past actually contributes)
    4. add + nudge + tanh                                     -> the newer memory
    5. put it back in pocket A

  Same dials as word 1 -- reused. That reuse -- one dial-set looping back over word after
  word -- is the whole idea, and it is why the textbook calls this RECURRENT (recurring:
  the same dials come round again every word). The one recipe to memorise:

    new memory = tanh(  word's 32  x  word-dials   +   old memory  x  memory-dials   +  nudge )
                        \__ this word folded in __/     \__ the past carried forward _/

  Walk all 100 words this way. After the last word, pocket A holds a 32-number SUMMARY of
  the whole review, with order baked in.

  ONE WORKER, NOT ONE-PER-WORD. The single most common wrong picture:

    NOT 100 workers, one per word
    YES  1 worker, who walks word -> word -> word 100 times, SAME dials every word

  This is exactly Chapter 9's reuse in a new suit: there, one inspector slid his ONE magic
  paper to every spot on the photo; here, one worker slides his ONE dial-set across all 100
  words. Because the dials are reused, a 100-word review and a 10-word review need the SAME
  dials -- one worker, one dial-set, just more steps for the longer review.

  How this differs from Chapter 9's inspector:

    inspector (Ch.9):  each spot is independent -- no memory between spots
    worker  (RNN):     carries memory FORWARD -- word 2 uses word 1's leftover memory

  The dial count for this floor, by hand:

    word-dials    32 x 32  = 1,024
    memory-dials  32 x 32  = 1,024
    nudge         32       =    32
                    total  = 2,080

  (Keras reports a SimpleRNN(32) sitting on a 32-wide note as exactly 2,080 -- it counts
  (32 + 32) x 32 + 32. Same arithmetic, same answer.)

    >> YOUR TURN
       Memory width 2 (not 32). Old memory = [0, 0] (first word). The word's 2 numbers are
       [1, 2]. word-dials = [[1, 0], [0, 1]] (the do-nothing paper), memory-dials anything,
       nudge = [0, 0]. tanh(1) ~ 0.76, tanh(2) ~ 0.96. What is the new memory?

       check your slate:
         WORD-part: row1 . [1,2] = 1x1 + 0x2 = 1 ; row2 . [1,2] = 0x1 + 1x2 = 2 -> [1, 2]
         MEMORY-part: old memory is [0,0], so anything x it = [0, 0]
         add + nudge: [1, 2] + [0, 0] + [0, 0] = [1, 2]
         squash: [tanh(1), tanh(2)] ~ [0.76, 0.96].  New memory ~ [0.76, 0.96].


  ## Why the Loop Needs the tanh Squash

  IN HAND: the worker rewrites its 32-number memory once per word, 100 times, each time
  ending with tanh. Why end with a squash at all? Here is the problem, made visible.

  A squash is a FIXED curve that bends ONE number into a tidy range, alone -- no dials,
  nothing learned, no looking at neighbours. Three squashes appear across this book:

    relu     (Chapter 7/9 clerks):   negative -> 0, keep positive    range  0 to infinity
    tanh     (this worker's memory): crush                           range -1 to +1
    sigmoid  (the final clerk):      crush                           range  0 to 1

    tanh by the numbers:  0 -> 0 ;  1 -> 0.76 ;  2 -> 0.96 ;  5 -> ~1.0 ;  -5 -> ~-1.0

  Now the problem. Take the memory and multiply it by the memory-dials 100 times in a row
  with NO squash between. Suppose one slot's dial run multiplies by 1.5 each word:

    1.5 x 1.5 x ... (100 times)  =  1.5^100  =  a number with 17 zeros -- it EXPLODES.

  And if the dial run multiplies by 0.5 each word instead:

    0.5^100  =  essentially zero -- the memory DIES.

  A 100-step loop with no crusher either blows up to nonsense or collapses to nothing. The
  tanh squash pins every memory number between -1 and +1 at EVERY word, so after 100 words
  the memory is still a tidy bounded row. The squash is what keeps the long loop polite --
  it is not decoration; without it the worker is unusable.

  ONE THING THE SQUASH IS NOT: a humbler. Chapter 9's humbler (batch norm) also keeps
  numbers in check, but it PEEKS at the whole handful of 64 to re-centre, it has learned
  dials (stretch, shift), and it keeps a diary. The tanh squash peeks at no one, learns
  nothing, keeps no diary -- it bends each number through the same fixed curve, alone, every
  time. Humbler = learned, crowd-aware re-centring. Squash = dumb, fixed, solo crusher.


  ## Floor 3: The Final Clerk (Dense + S-curve)

  IN HAND: floor 2 walked all 100 words and left a 32-number summary in pocket A.

  One clerk reads those 32 numbers, multiplies each by its dial, adds them and a nudge into
  one running total, then squashes that total through the S-curve into a chance between 0
  and 1 (the same S-curve derived from odds in Chapter 7):

    big positive total  ->  near 1   (liked)
    zero total          ->  0.5      (on the fence)
    big negative total  ->  near 0   (did not like)

  Dials here: 32 (one per memory-number) + 1 nudge = 33.

  The whole factory's dial count, recomputed:

    floor 1  notepad         10,000 x 32  = 320,000
    floor 2  walking worker  (above)       =   2,080
    floor 3  final clerk     32 + 1         =      33
                                     total  = 322,113

  Just as in Chapter 9's factory, one floor hogs the dials -- there it was the Dense floor,
  here it is the notepad (320,000 of 322,113, about 99%). The famous "recurrent" worker is
  a rounding error in the dial budget. The cost is not in the dials; it is in the WALK --
  100 steps per review, in order, one after another, which is the bottleneck Part 2's
  cousins inherit and the Transformer (a later chapter) finally breaks.


  ## Common Tripwires

  Built from the live lab session -- every confusion actually hit.

  TRIPWIRE 1  "10,000 is the number of reviews."
    No. 10,000 is the number of distinct WORDS kept (the vocabulary, a knob). The reviews
    are the ROWS -- maybe 50,000. Two different counts.

  TRIPWIRE 2  "The word-number is a measurement -- bigger means more."
    No. It is a NAME TAG. boring=4, masterpiece=73 carry no size-meaning; 73 is not "bigger"
    than 4 in any useful sense. That is exactly why floor 1 swaps the tag for a learned note.

  TRIPWIRE 3  "The notepad comes from the data."
    No. The notepad starts as random junk and the machine FILLS it itself by dial-turning,
    pulling similar words close. It is workspace, like every other dial -- not a lookup
    table read off the reviews.

  TRIPWIRE 4  "Floor 2 is 100 workers, one per word."
    No. ONE worker, walking 100 steps, reusing the SAME dial-set every step. That reuse is
    what 'recurrent' means, and it is why a long review and a short review share one dial-set.

  TRIPWIRE 5  "Why carry a memory? Just average the word-notes."
    Averaging throws away ORDER, and order is the whole point. "not boring" and "boring not"
    average to the identical thing; only a memory that remembers "not" when it reaches
    "boring" can tell them apart.

  TRIPWIRE 6  "Padding with zeros adds fake words the machine learns from."
    The 0 slot is reserved as EMPTY and its notepad row is kept all-zero, so a padded slot
    contributes a row of zeros -- nothing. It only makes every review the same shape so the
    fixed dials can eat them.

  TRIPWIRE 7  "Embedding width 1 (just a lean) would do."
    One number cannot encode "I am a flipper" (not) or "I am an amplifier" (very) separately
    from a lean. Width 32 gives the word room to be several things at once. Width is a knob.


  ## The Labels, Last

    pile of words, 1/0 per row       sentiment classification (binary)
    number each word by frequency    tokenisation by frequency rank
    10,000 kept words                vocabulary size (VOCAB_SIZE / num_words)
    shared rare-word bin             <unknown> / out-of-vocabulary (OOV) token
    same-length fix                  padding / truncating (pad_sequences)
    empty slot = 0                   pad token (id 0), mask
    MAX_LEN = 100                    sequence length
    notepad                          embedding layer
    one note (32 numbers)            word vector / embedding vector
    note width 32                    embedding dimension (EMBEDDING_DIM)
    walking worker, one memory       SimpleRNN (recurrent layer)
    pocket A (memory)                hidden state h_t
    word-dials / memory-dials        input weights W / recurrent weights U
    reuse same dials each word       weight sharing across time steps
    crush to -1..+1                  tanh activation
    final clerk + S-curve            Dense(1) + sigmoid
    wrongness for yes/no             binary cross-entropy
    downhill-roller                  Adam optimiser


  ## Code, If You Want It

  Nothing above needed a computer; this section is for the day you meet one.

  Open the pile and check the balance (Q1):

  ```python
  df = load_data(DATA_PATH)                 # CSV with columns 'text' and 'label'
  q1_shape = df.shape                        # (how many reviews, 2)
  q1_dist  = df['label'].value_counts().sort_index()   # how many 0s, how many 1s
  ```

  Number the words, cap the vocabulary, pad to one length, split (Q2):

  ```python
  from tensorflow.keras.preprocessing.text import Tokenizer
  from tensorflow.keras.preprocessing.sequence import pad_sequences
  from sklearn.model_selection import train_test_split

  VOCAB_SIZE = 10000          # keep the 10,000 most common words (a knob)
  MAX_LEN    = 100            # force every review to 100 numbers (a knob)

  tok = Tokenizer(num_words=VOCAB_SIZE, oov_token="<unk>")
  tok.fit_on_texts(df['text'])                    # count words, rank by frequency
  seqs = tok.texts_to_sequences(df['text'])       # words -> name-tag numbers
  X = pad_sequences(seqs, maxlen=MAX_LEN,
                    padding='post', truncating='post')   # 0 = empty slot
  y = df['label'].values

  X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.2, random_state=RANDOM_STATE)
  ```

  Build the three-floor text-factory (Q3):

  ```python
  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import Input, Embedding, SimpleRNN, Dense

  EMBEDDING_DIM = 32

  model_rnn = Sequential()
  model_rnn.add(Input(shape=(MAX_LEN,)))                            # 100 word-numbers in
  model_rnn.add(Embedding(input_dim=VOCAB_SIZE,
                          output_dim=EMBEDDING_DIM))                # floor 1: the notepad
  model_rnn.add(SimpleRNN(32))                                      # floor 2: walking worker
  model_rnn.add(Dense(1, activation="sigmoid"))                     # floor 3: final yes/no
  model_rnn.compile(optimizer="adam",
                    loss="binary_crossentropy", metrics=["accuracy"])
  ```

  - `binary_crossentropy` = wrongness for a yes/no guesser: -log(the chance given to the
    true answer). Truth 1, said 0.9 -> tiny; said 0.1 -> huge. (Chapter 7 derives it.)
  - `adam` = the downhill-roller from Chapter 7, Part 2.
  - `model_rnn.summary()` prints 322,113 total dials -- 320,000 in the notepad alone.

  Training this factory, watching it forget far-back words, and the two-memory fix that
  rescues it (the LSTM) are Part 2.

  --> Continue: Chapter 10, Part 2: The Two-Memory Worker


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 10 -- Machines That Read Words):
    Part 1 (this post) .
    Part 2 -- The Two-Memory Worker

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================