The Two-Memory Worker: How an LSTM Remembers Far-Back Words (RNNs by Hand, Part 2)

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 10 . MACHINES THAT READ WORDS . PART 2 OF 2
  The Two-Memory Worker: How an LSTM Remembers Far-Back Words
  Posted: 2026-06-13 . Author: Rahul Rai . Tags: lstm, bilstm, rnn, sequence-models
  ============================================================================================

  PATH . post 28 of 28
    <- prev:  Chapter 10, Part 1: Words Into a Machine
       next:  (more coming) ->

  Part 1 built a text-factory: a NOTEPAD turns each word into a 32-number note; ONE walking
  worker reads the 100 notes IN ORDER, rewriting a 32-number memory after every word; a
  final clerk reads the last memory for thumbs up or down. We can now TRAIN it (Q4) -- and
  then watch it fail in a specific, fixable way. Over a long review the plain worker FORGETS
  the opening words. This post derives the cure by hand: a worker who carries TWO memories
  and lets three little voters decide what to keep, what to admit, and what to speak. That
  worker is the LSTM. Then one more idea -- read the review BOTH directions (the BiLSTM) --
  and an honest, head-to-head comparison of all four machines.

  One blunt warning, the same one Chapter 9 earned: the exact wiring of the LSTM below was
  ENGINEERED by trial to stop the forgetting, NOT derived from a single clean principle.
  Many wirings work about equally; a simpler cousin (the GRU) is nearly as good. I will flag
  this again where it matters. We build it by hand anyway, because the mechanism is the
  lesson.

  Pencil out. Pocket A, pocket B, and a row of clerks.


  ## Train the Plain Worker (Q4)

  IN HAND: the Part 1 factory -- notepad (320,000 dials), one walking worker (2,080), a
  final clerk (33), 322,113 dials total. Study pile of reviews, each 100 word-numbers, exam
  sealed. To TRAIN is to turn every dial a little toward "less wrong", over and over.

  The recipe is Chapter 9's, in words:

    5 read-throughs of the study reviews (5 loops / epochs)
    64 reviews per grab (a handful / batch)
    for each grab: the one worker walks each review's 100 notes, building a memory;
                   the final clerk guesses liked/not; the average wrongness over the 64
                   turns EVERY dial a tiny notch toward right
    then read the exam score (val_accuracy) after the last loop

  Cost, by hand, on a study pile of (say) 40,000 reviews:

    grabs per loop  = 40,000 / 64  = 625 dial-turns
    five loops      = 625 x 5       = 3,125 dial-turns total

  Each dial-turn walks 64 reviews x 100 words = 6,400 worker-steps just in floor 2, every
  one reusing the same dials. The worker is cheap in DIALS but expensive in STEPS, and the
  steps must happen IN ORDER -- word 2 needs word 1's memory. Hold that fact; it is the
  whole reason for the chapter's bottleneck and the eventual Transformer.

    >> YOUR TURN
       Study pile of 32,000 reviews, grab size 64, 5 loops. How many dial-turns?

       check your slate: 32,000 / 64 = 500 grabs per loop. 500 x 5 = 2,500 dial-turns.


  ## Why the Plain Worker Forgets (the Fade)

  IN HAND: the plain worker's one recipe, every word:

    new memory = tanh( word x word-dials  +  old memory x memory-dials  +  nudge )

  Look at what happens to the OLD memory: every single word it is matrix-multiplied AND
  crushed through tanh. Crush, multiply, crush, multiply -- 100 times over a 100-word review.

  Word 1 was "not". Its mark on the memory is mangled and re-crushed on word 2, again on
  word 3, ... and by word 90 that mark has been squeezed toward nothing. Not erased in one
  blow -- WORN DOWN over ninety crushings, like a pencil mark rubbed ninety times. So at
  word 90 the worker has effectively forgotten the "not" at word 1, and a long review's
  early flip is lost.

    RNN memory:  tanh( W . old  + ... )   -> mangled + crushed each word  -> FADES
    we want:     keep x old + add x new   -> scaled + added each word     -> SURVIVES

  The cure, in one line before we build it: stop crushing the long memory as it carries.
  Keep a memory that is only SCALED by a fraction and ADDED to -- never crushed on the way
  through -- so word 1 can ride all the way to word 90.

  (Who found this: Hochreiter and Schmidhuber, 1997; Hochreiter's 1991 thesis is where the
  fade was first spotted. The keep-or-forget voter was added by Gers, Schmidhuber and
  Cummins in 2000.)


  ## The Two Memories

  The fix carries TWO memories instead of one. Two pant pockets, A and B, both starting all
  zeros:

    memory-A = the LONG KEEP.  Used as-is: only SCALED by a fraction and ADDED to. NEVER
               fed through a squash on the carry. Because it is never crushed, it CAN grow
               past 1 (it might read 5.0) and far-back marks survive in it.

    memory-B = the SPOKEN, recent memory. THIS is what gets fed into the little machines
               next word. It IS crushed -- it is a show-fraction times tanh(A).

  Keep the two straight: A is the savings account (long, untouched on the carry); B is the
  cash in hand (what gets shown around and fed back in). The plain worker had only one
  pocket, and crushing it every word is what made it forget.


  ## The Four Little Machines (all read the SAME two things)

  IN HAND: this word's note (call it the word32), plus memory-B from last word. Memory-A is
  NOT read by any machine -- hold that thought.

  Four little machines run side by side. EVERY one of them reads the same two raw things --
  the word32 and memory-B -- and nothing else (not A, not each other, not the new memory).
  Each has its OWN two dial-papers (one for the word32, one for memory-B) and its own nudge.
  All four are computed in parallel, then combined.

    FRESH VALUE:  tanh(    word32 x Wc  +  memoryB x Uc  +  nudge_c )  -> -1..+1
    KEEP-VOTER:   sigmoid( word32 x Wk  +  memoryB x Uk  +  nudge_k )  ->  0..1
    ADMIT-VOTER:  sigmoid( word32 x Wa  +  memoryB x Ua  +  nudge_a )  ->  0..1
    SHOW-VOTER:   sigmoid( word32 x Ws  +  memoryB x Us  +  nudge_s )  ->  0..1

  Two squashes in play, both from earlier chapters:
    tanh    crushes to -1..+1  (a VALUE, can be positive or negative)
    sigmoid crushes to  0..1   (a FRACTION -- "how much", a dimmer knob from 0% to 100%)

  ONE MACHINE -- TWO PAPERS INSIDE. Pick any formula above, say KEEP-VOTER:
    sigmoid( word32 x Wk  +  memoryB x Uk  +  nudge_k )
  word32 goes through paper Wk (a 32x32 matrix of dials); memory-B goes through paper Uk (a
  separate 32x32 matrix). Their two outputs ADD, the nudge joins, THEN the one squash. This
  is NOT a "word-machine" whose output feeds a "memory-machine" -- both papers live INSIDE the
  same one machine, and the combine happens BEFORE the squash. All four machines follow this
  exact shape, each with its own W, its own U, its own nudge.

  Here is the quiet punchline: the FRESH VALUE machine -- tanh of (word + old recent + nudge)
  -- IS the plain worker's recipe from Part 1, unchanged. The LSTM does not throw the old
  worker away. It keeps it as ONE of four parts (the fresh candidate value), and wraps three
  0..1 voters around it. LSTM = the old RNN cell + three dimmer knobs.


  ## READ Is Not MULTIPLY (the knot)

  This is the single hardest knot in the whole chapter, so it gets its own table. A voter is
  BORN FROM one thing and APPLIED TO another -- and they are different things.

                  READS (to compute its 0..1)        APPLIED TO (what it scales)
    -------------------------------------------------------------------------------
    keep-voter    word32 + memory-B                  memory-A   (the long keep)
    admit-voter   word32 + memory-B                  fresh value
    show-voter    word32 + memory-B                  tanh(new memory-A)
    fresh value   word32 + memory-B                  (it IS the value)

  Born-from is the SAME for all four (word32 + memory-B). Applied-to is DIFFERENT for each.
  The trap is to think a voter reads what it scales -- it does not. The keep-voter never
  looks at memory-A; it looks at today (word + recent), decides "keep 90%", and THEN does
  0.9 x memory-A.

  WHY HAVE MEMORY-A IF NO VOTER READS IT? Because the votes are APPLIED to it. A is the
  treasure -- the long savings. The voters, deciding from today's word and the recent
  memory, reach over and SCALE the treasure (keep 90% of it) and ADD to it (admit some
  fresh value). Without A, nothing carries long; A is the thing the votes act ON, even
  though they are decided from B.


  ## Loudness Is Not Worth

  The most natural wrong assumption, worth its own section: a LOUDER fresh value (say +0.9,
  near the top of the tanh band) must matter MORE than a quiet one. It does not. The fresh
  value's SIZE and its WORTH are two separate things, decided by two separate squashes:

    fresh value (tanh)   -> the LOUDNESS: how big, which sign, range -1..+1
    admit-voter (sigmoid) -> the WORTH:   how much to let in, range 0..1 (a dimmer knob)

  What lands on the long memory is loudness TIMES worth, slot by slot. Watch two slots:

    LOUD but worthless:  fresh +0.9  x  admit-frac 0.10  =  +0.09   (let in almost nothing)
    QUIET but needed:    fresh +0.2  x  admit-frac 0.95  =  +0.19   (let in nearly all)

  The quiet-but-needed slot contributes MORE than the loud-but-worthless one -- +0.19
  against +0.09 -- even though its raw value is far smaller. The admit-voter looked at
  today's word and the recent memory, judged the loud value irrelevant right now, and choked
  it down to a whisper before it ever reached memory-A. Loud does not buy a seat; the voter
  sells the seat. This is exactly why the fresh value alone (the plain RNN) is not enough --
  it has loudness but no separate judge of worth.


  ## The Combine, with Real Numbers

  IN HAND: four machines have produced, for this word: a fresh value, a keep-fraction, an
  admit-fraction, a show-fraction (each a row of 32 numbers). Memory-A and memory-B from
  last word are on the desk. Combine them into the new A and new B.

    new memory-A = keep-frac x memory-A   +   admit-frac x fresh value
                   \_ keep some old long _/    \_ let in some new _/

    new memory-B = show-frac x tanh(new memory-A)

  Every multiply here is PAIR-BY-PAIR (element-wise) -- slot 1 with slot 1, slot 2 with slot
  2, no adding across slots. All 32 slots stay 32 slots. (The only place that ADDS across
  slots is inside each machine's dial-paper, where a row of products is summed to one number.
  The voter-multiplies in the combine do NOT sum.)

  One number first -- the hardest part, proved for 1 slot before all 3:

    old A       = 5.0   (the long keep; never crushed, so it grew big)
    FRESH       = 0.6   (new content, tanh-bounded to -1..+1)
    keep-frac   = 0.9   (vote: hold 90% of A)
    admit-frac  = 0.2   (vote: let in 20% of FRESH)
    show-frac   = 0.7   (vote: speak 70% of the tamed A)

    new A = 0.9 x 5.0  +  0.2 x 0.6  =  4.5 + 0.12  =  4.62
    A went 5.0 -> 4.62. Held almost all of itself (4.5). Let in a touch (0.12).

    tanh(4.62) ~ 1.0   (A grew large; tanh tames it back into -1..+1)
    new B = show x tanh(new A) = 0.7 x 1.0 = 0.70

    A = the silent vault (4.62, uncrushed, can grow big). B = what you say out loud about it
    (0.70). A survives because it was only scaled-and-added, never crushed on the carry.

  All 3 slots, same rule applied element-wise (a real review uses 32):

    memory-A   = [ 5.0 , -2.0 , 0.3 ]     <- note the 5.0: never crushed, so it grew big
    keep-frac  = [ 0.9 ,  0.5 , 1.0 ]
    fresh val  = [ -0.8,  0.6 , 0.2 ]
    admit-frac = [ 0.3 ,  0.0 , 0.9 ]

    new A, slot by slot:
      0.9 x  5.0  +  0.3 x (-0.8) = 4.50 - 0.24 = 4.26
      0.5 x (-2.0) + 0.0 x  0.6   = -1.00 + 0   = -1.00
      1.0 x  0.3   + 0.9 x  0.2   =  0.30 + 0.18 = 0.48
    new memory-A = [ 4.26 , -1.00 , 0.48 ]

  Read slot 2: keep-frac 0.5 halved the old -2.0 to -1.0, and admit-frac 0.0 let in NONE of
  the fresh 0.6. The voter said "this slot: keep half, admit nothing." Now speak it out:

    show-frac  = [ 0.7 , 0.2 , 1.0 ]
    tanh(new A) = [ tanh(4.26), tanh(-1.00), tanh(0.48) ] ~ [ 1.00 , -0.76 , 0.45 ]
    new memory-B:
      0.7 x  1.00 = 0.70
      0.2 x (-0.76) = -0.15
      1.0 x  0.45 = 0.45
    new memory-B = [ 0.70 , -0.15 , 0.45 ]

  Put new A in pocket A, new B in pocket B, move to the next word, SAME dials. Notice memory
  -A's slot 1 is 4.26 -- well past 1, because it was only scaled and added, never crushed.
  THAT is the escape from the fade: a far-back mark can sit in A at 5.0 and still be there
  90 words later.

  Show's dual role: every new B lands in two places.
    1. NEXT WORD -- all four machines read it alongside the next word's note.
    2. LAST WORD -- on word 100, B is the memory the final clerk reads for liked/not.
  B is the LSTM's public face, built anew each word, always read by what comes next.
  (A is never handed to the clerk and never read by any machine -- only used in the combine.)

  And the B at the end is the WHOLE review, not the first word. Walk "nolan ended": B after
  word 1 has seen only "nolan"; B after word 2 has seen "nolan" AND "ended" folded in. The
  final B -- the one the clerk reads -- is the last one, so it carries every word, not just
  the opener. Each B swallows the one before it.

  The dial count for an LSTM worker, by hand: it is the plain worker FOUR TIMES OVER (four
  machines, each with a word-paper, a memory-paper, a nudge):

    one machine:  32x32 + 32x32 + 32  = 2,080
    four machines: 4 x 2,080          = 8,320

  (Keras reports LSTM(32) on a 32-wide note as exactly 8,320. Four little RNN cells.)

    >> YOUR TURN  (one slot, by hand)
       memory-A slot = 4.0, keep-frac = 0.5, fresh value = 2.0, admit-frac = 1.0.
       What is the new memory-A slot? Then with show-frac = 1.0 and tanh(new A) ~ 1.0
       (since new A will be large), what is the new memory-B slot?

       check your slate:
         new A = 0.5 x 4.0 + 1.0 x 2.0 = 2.0 + 2.0 = 4.0
         tanh(4.0) ~ 0.9993, call it ~1.0
         new B = 1.0 x 1.0 = 1.0
       Kept half the old long memory (2.0) and admitted all the fresh value (2.0).


  ## Build and Train the LSTM (Q5, Q6) -- One Word Changed

  IN HAND: Part 1's factory was Input -> Embedding(notepad) -> SimpleRNN(32) -> Dense(1).
  The LSTM factory is the SAME factory with ONE word swapped: SimpleRNN becomes LSTM.
  Nothing else changes. The notepad is identical; the final clerk is identical.

    floor 2 was:  one worker, one memory, crush-and-rewrite        (SimpleRNN)
    floor 2 now:  one worker, TWO memories, three voters           (LSTM)

  Train it exactly as the plain worker -- 5 loops, grabs of 64, read the exam score. Expect
  the LSTM's exam score to be AT LEAST the plain worker's, often a little better: it
  remembers far-back clues ("not ... at all") that the plain worker had let fade. The whole
  point of the extra wiring is that one improvement.


  ## Read It Both Ways: the BiLSTM (Q7)

  IN HAND: one LSTM worker walks the review front-to-back, building a memory in which word N
  has heard words 1..N -- everything BEFORE it, nothing after.

  But some clues sit AFTER a word:

    "not boring at all"  ->  the "at all" comes LATER and confirms the flip

  A front-to-back worker reaching "boring" has seen "not" (good) but not yet "at all". The
  fix is one new idea: run TWO LSTM workers.

    worker -> :  walks front-to-back, builds its own final memory
    worker <- :  walks back-to-front, builds its own final memory
    glue both final memories together, hand the doubled memory to the final clerk

  Now a word's verdict can lean on clues from BOTH sides -- the words before it AND the words
  after it. Two separate workers, each with its own full dial-set; the glue just lays one
  final memory beside the other.

  Dial count, by hand: two LSTM workers = 2 x 8,320 = 16,640. The glued memory is now 64
  wide (32 from each worker), so the final clerk grows to 64 + 1 = 65 dials. (The notepad is
  unchanged at 320,000.)

  This is the only genuinely new idea in Q7, and it is cheap: run the same worker the other
  way and glue. It is NOT a third kind of machine -- it is two LSTMs facing opposite
  directions.


  ## All Four, Head to Head, Honestly (Q10 preview)

  IN HAND: four factories now exist, all sharing the same notepad and final-clerk shape,
  differing only in floor 2:

    plain worker   (SimpleRNN)        one memory, crush-and-rewrite
    LSTM           (LSTM)             two memories, three voters
    BiLSTM         (Bidirectional)    two LSTMs, both directions, glued

  Grade them on the ONE number that matters -- the share right on the SEALED exam pile
  (val_accuracy), never on the study pile. The honest expectation, not a promise:

    plain worker:   a baseline; forgets far-back, so it caps out lower on long reviews
    LSTM:           usually >= the plain worker; the far-back memory earns its keep
    BiLSTM:         usually the best of the three; both-sides clues help on tricky flips

  The gaps can be small on short reviews (100 words is not very long) and on only 5 loops --
  the same honesty Chapter 9 needed. The ranking is the lesson, not the exact decimals. Pick
  whichever scored highest on the pile you never touched while training.

  How the Transformer differs from all three, in one line each:

    LSTM / BiLSTM   WALK word by word. Carry pockets A and B. 4 machines each word. Votes
                    decide what survives in A and what is spoken as B. A word's past lives in
                    A; its future is unknown to a plain LSTM.

    TRANSFORMER     NO walk. NO pocket. ALL words at once. Each word makes three tags (want /
                    have / give); each word checks ALL other words' "have" against its own
                    "want", to decide how much of each "give" to blend in. No sequence, no
                    memory chain, no votes. That is a later chapter.


  ## Honest Note: Engineered, Not Derived

  Worth saying plainly, because it is true and rarely said. The plain worker's fade is a
  real, derivable problem -- you can watch the memory get crushed 90 times. But the EXACT
  LSTM wiring -- two memories, exactly these three voters, this particular combine -- was
  ENGINEERED by trial until it stopped the fade. It was not deduced from one clean principle.

    - Many variants work about equally well.
    - The GRU folds the voters down to two and merges the memories; nearly as good, fewer dials.
    - This precise structure is not sacred -- it is one solution that happened to work.

  So: the PROBLEM (the fade) is a theorem you can prove with a pencil. The SOLUTION'S exact
  shape is a CHOICE among many that work (Charter Law 7). Hold the mechanism -- scale-and-add
  a long memory, gate it with dimmer knobs -- and treat the precise count of voters as a
  historical accident, not a law.


  ## Common Tripwires

  Built from the live derivation -- every confusion actually hit, logged so it stays fixed.

  TRIPWIRE 1  "More is better -- a loud fresh value should count a lot."
    No. A loud value (a big tanh near +1 or -1) can still be deemed worthless: the admit-voter
    can set its fraction near 0 and let almost none of it in. LOUDNESS is the value; WORTH is
    the voter, judged separately.

  TRIPWIRE 2  "The old memory can't be 5.0 -- tanh keeps it inside -1..+1."
    True for the PLAIN worker (crushed every word). FALSE for the LSTM's memory-A, which is
    never crushed on the carry -- only scaled and added -- so it CAN read 5.0. That is the
    whole escape from the fade.

  TRIPWIRE 3  "A voter reads the thing it multiplies."
    No. All four machines READ the same two things (word32 + memory-B). They are APPLIED to
    DIFFERENT targets (keep -> A, admit -> fresh value, show -> tanh(new A)). Born-from same,
    applied-to different.

  TRIPWIRE 4  "If no voter reads memory-A, why keep A?"
    Because the votes are APPLIED to A -- they scale it and add to it. A is the long-term
    treasure the votes act on, even though the voters decide their fractions from B and the
    word. Drop A and nothing carries far.

  TRIPWIRE 5  "Two memories means two copies of the same memory."
    No -- two DIFFERENT memories. A is the long keep (never crushed on the carry); B is the
    spoken recent (show-frac x tanh(A)). Different roles, different numbers.

  TRIPWIRE 6  "The voter-multiply sums across slots, like the dial-papers do."
    No. Two different multiplies. The dial-paper inside a machine SUMS a row of products into
    one number (a dot product). The combine's voter-multiplies are PAIR-BY-PAIR and do NOT
    sum -- all 32 slots survive as 32 slots.

  TRIPWIRE 7  "Add the nudge after the squash."
    No. Inside each machine: word-part + memory-part + nudge FIRST, THEN the one squash
    (tanh or sigmoid) at the end. One crush, last.

  TRIPWIRE 8  "The fresh value is some new invention."
    It is the plain Part 1 worker's recipe, reused unchanged as one of the four machines. The
    LSTM = that old cell + three voters.

  TRIPWIRE 9  "A GPU runs the words in parallel like it runs the matrix multiply."
    No. The dial-paper multiply and the separate reviews run in parallel, but the WORDS must
    be walked IN ORDER -- each word needs the previous word's memory. Breaking that ordered
    chain is exactly what the Transformer (a later chapter) does.

  TRIPWIRE 10  "The LSTM is derived from first principles, so the wiring is forced."
    No. The fade is derivable; the exact wiring was engineered by trial. Variants (e.g. the
    GRU) work about as well. The mechanism is the lesson, not the precise voter count.

  TRIPWIRE 11  "The four machines are two machines -- a word-machine that feeds a B-machine."
    No -- ONE machine, two papers inside. In KEEP-VOTER: sigmoid(word32 x Wk + memoryB x Uk
    + nudge_k), word32 and memoryB are both inputs that ADD inside the same one machine before
    the squash. There is no cascade. Each of the four machines has this one-combine structure,
    with its own W, its own U, its own nudge.

  TRIPWIRE 12  "Memory-B is a matrix."
    No. B is a VECTOR -- 32 numbers, same shape as the word32 embedding. The DIALS are the
    matrices: 8 papers (Wf/Uf, Wk/Uk, Wa/Ua, Ws/Us), each 32x32. A and B are vectors; W and
    U are 32x32 matrices; nudges are 32-wide vectors (one per machine).

  TRIPWIRE 13  "keep = 1 minus admit -- what you don't keep, you admit."
    No. There is NO subtraction here. keep and admit are TWO separate machines, each with its
    own dials, each landing its own 0..1 fraction. keep 0.9 and admit 0.2 can BOTH be high, or
    both low -- the slot can hold 90% of the old AND let in 20% of the new at once. (A cousin,
    the GRU, DOES tie them with a 1-minus; this LSTM does not.)

  TRIPWIRE 14  "The final memory-B is the first word's, or one word's."
    No. Each B folds in the one before it: B after "nolan ended" has BOTH words baked in, not
    just "nolan". The final B the clerk reads is the whole-review summary -- the last B, which
    swallowed every earlier B. It is not tied to any single word.


  ## The Labels, Last

    plain walking worker             SimpleRNN (vanilla recurrent layer)
    the fade / worn-down memory      vanishing gradient over time steps
    long keep, never crushed         cell state (c_t)
    spoken recent memory             hidden state (h_t)
    fresh value (tanh)               candidate / cell input g_t (= the RNN cell)
    keep-voter                       forget gate (f_t)
    admit-voter                      input gate (i_t)
    show-voter                       output gate (o_t)
    dimmer knob 0..1                 sigmoid gate
    scale-and-add the long memory    c_t = f_t * c_{t-1} + i_t * g_t
    speak the long memory            h_t = o_t * tanh(c_t)
    pair-by-pair multiply            element-wise (Hadamard) product
    two-memory worker                LSTM (Long Short-Term Memory)
    simpler cousin, two voters       GRU (Gated Recurrent Unit)
    read both directions, glue       Bidirectional wrapper
    glued forward+back memory        concatenated hidden states (2 x units)
    share right on sealed pile       validation accuracy
    A, B, word-note (vectors)        shape (units,) -- 32 numbers each
    W, U (matrices / papers)         shape (input_dim, units) -- 32x32 each
    nudge per machine                bias vector, shape (units,)


  ## Code, If You Want It

  Nothing above needed a computer; this section is for the day you meet one.

  Train the plain worker (Q4) -- a routine fit:

  ```python
  history_rnn = model_rnn.fit(
      X_train, y_train,
      epochs=5,                          # 5 read-throughs of the study pile
      batch_size=64,                     # a handful of 64 reviews per dial-turn
      validation_data=(X_test, y_test),  # this lab grades on the test pile
  )
  q4_val_acc = round(float(history_rnn.history['val_accuracy'][-1]), 3)  # last loop
  ```

  Build the LSTM (Q5) -- it is the Q3 factory with ONE word changed:

  ```python
  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import Input, Embedding, LSTM, Dense

  model_lstm = Sequential()
  model_lstm.add(Input(shape=(MAX_LEN,)))
  model_lstm.add(Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM))
  model_lstm.add(LSTM(32))                       # the ONLY change: LSTM, not SimpleRNN
  model_lstm.add(Dense(1, activation="sigmoid"))
  model_lstm.compile(optimizer="adam",
                     loss="binary_crossentropy", metrics=["accuracy"])
  # model_lstm.summary() shows floor 2 = 8,320 dials -- four little RNN cells.
  ```

  Train the LSTM (Q6) -- same fit, expect val_acc usually >= the RNN's:

  ```python
  history_lstm = model_lstm.fit(
      X_train, y_train,
      epochs=5, batch_size=64,
      validation_data=(X_test, y_test),
  )
  q6_val_acc = round(float(history_lstm.history['val_accuracy'][-1]), 3)
  ```

  The BiLSTM (Q7) -- wrap the LSTM to read both ways:

  ```python
  from tensorflow.keras.layers import Bidirectional

  model_bilstm = Sequential()
  model_bilstm.add(Input(shape=(MAX_LEN,)))
  model_bilstm.add(Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM))
  model_bilstm.add(Bidirectional(LSTM(32)))      # two LSTMs, opposite directions, glued
  model_bilstm.add(Dense(1, activation="sigmoid"))
  model_bilstm.compile(optimizer="adam",
                       loss="binary_crossentropy", metrics=["accuracy"])

  history_bilstm = model_bilstm.fit(X_train, y_train, epochs=5, batch_size=64,
                                    validation_data=(X_test, y_test))
  q7_val_acc = round(float(history_bilstm.history['val_accuracy'][-1]), 3)
  # floor 2 = 16,640 dials (2 x 8,320); the glued memory is 64 wide.
  ```

  Compare all four val-accuracies and pick the best (Q10) -- one number per factory, read
  off the sealed pile. Still ahead, in a later chapter: the Transformer (attention), which
  looks at ALL words at once and kills the ordered word-by-word walk entirely.


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 10 -- Machines That Read Words):
    Part 1 -- Words Into a Machine .
    Part 2 (this post)

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================