==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 10 . MACHINES THAT READ WORDS . PART 2 OF 2
The Two-Memory Worker: How an LSTM Remembers Far-Back Words
Posted: 2026-06-13 . Author: Rahul Rai . Tags: lstm, bilstm, rnn, sequence-models
============================================================================================
PATH . post 28 of 28
<- prev: Chapter 10, Part 1: Words Into a Machine
next: (more coming) ->
Part 1 built a text-factory: a NOTEPAD turns each word into a 32-number note; ONE walking
worker reads the 100 notes IN ORDER, rewriting a 32-number memory after every word; a
final clerk reads the last memory for thumbs up or down. We can now TRAIN it (Q4) -- and
then watch it fail in a specific, fixable way. Over a long review the plain worker FORGETS
the opening words. This post derives the cure by hand: a worker who carries TWO memories
and lets three little voters decide what to keep, what to admit, and what to speak. That
worker is the LSTM. Then one more idea -- read the review BOTH directions (the BiLSTM) --
and an honest, head-to-head comparison of all four machines.
One blunt warning, the same one Chapter 9 earned: the exact wiring of the LSTM below was
ENGINEERED by trial to stop the forgetting, NOT derived from a single clean principle.
Many wirings work about equally; a simpler cousin (the GRU) is nearly as good. I will flag
this again where it matters. We build it by hand anyway, because the mechanism is the
lesson.
Pencil out. Pocket A, pocket B, and a row of clerks.
## Train the Plain Worker (Q4)
IN HAND: the Part 1 factory -- notepad (320,000 dials), one walking worker (2,080), a
final clerk (33), 322,113 dials total. Study pile of reviews, each 100 word-numbers, exam
sealed. To TRAIN is to turn every dial a little toward "less wrong", over and over.
The recipe is Chapter 9's, in words:
5 read-throughs of the study reviews (5 loops / epochs)
64 reviews per grab (a handful / batch)
for each grab: the one worker walks each review's 100 notes, building a memory;
the final clerk guesses liked/not; the average wrongness over the 64
turns EVERY dial a tiny notch toward right
then read the exam score (val_accuracy) after the last loop
Cost, by hand, on a study pile of (say) 40,000 reviews:
grabs per loop = 40,000 / 64 = 625 dial-turns
five loops = 625 x 5 = 3,125 dial-turns total
Each dial-turn walks 64 reviews x 100 words = 6,400 worker-steps just in floor 2, every
one reusing the same dials. The worker is cheap in DIALS but expensive in STEPS, and the
steps must happen IN ORDER -- word 2 needs word 1's memory. Hold that fact; it is the
whole reason for the chapter's bottleneck and the eventual Transformer.
>> YOUR TURN
Study pile of 32,000 reviews, grab size 64, 5 loops. How many dial-turns?
check your slate: 32,000 / 64 = 500 grabs per loop. 500 x 5 = 2,500 dial-turns.
## Why the Plain Worker Forgets (the Fade)
IN HAND: the plain worker's one recipe, every word:
new memory = tanh( word x word-dials + old memory x memory-dials + nudge )
Look at what happens to the OLD memory: every single word it is matrix-multiplied AND
crushed through tanh. Crush, multiply, crush, multiply -- 100 times over a 100-word review.
Word 1 was "not". Its mark on the memory is mangled and re-crushed on word 2, again on
word 3, ... and by word 90 that mark has been squeezed toward nothing. Not erased in one
blow -- WORN DOWN over ninety crushings, like a pencil mark rubbed ninety times. So at
word 90 the worker has effectively forgotten the "not" at word 1, and a long review's
early flip is lost.
RNN memory: tanh( W . old + ... ) -> mangled + crushed each word -> FADES
we want: keep x old + add x new -> scaled + added each word -> SURVIVES
The cure, in one line before we build it: stop crushing the long memory as it carries.
Keep a memory that is only SCALED by a fraction and ADDED to -- never crushed on the way
through -- so word 1 can ride all the way to word 90.
(Who found this: Hochreiter and Schmidhuber, 1997; Hochreiter's 1991 thesis is where the
fade was first spotted. The keep-or-forget voter was added by Gers, Schmidhuber and
Cummins in 2000.)
## The Two Memories
The fix carries TWO memories instead of one. Two pant pockets, A and B, both starting all
zeros:
memory-A = the LONG KEEP. Used as-is: only SCALED by a fraction and ADDED to. NEVER
fed through a squash on the carry. Because it is never crushed, it CAN grow
past 1 (it might read 5.0) and far-back marks survive in it.
memory-B = the SPOKEN, recent memory. THIS is what gets fed into the little machines
next word. It IS crushed -- it is a show-fraction times tanh(A).
Keep the two straight: A is the savings account (long, untouched on the carry); B is the
cash in hand (what gets shown around and fed back in). The plain worker had only one
pocket, and crushing it every word is what made it forget.
## The Four Little Machines (all read the SAME two things)
IN HAND: this word's note (call it the word32), plus memory-B from last word. Memory-A is
NOT read by any machine -- hold that thought.
Four little machines run side by side. EVERY one of them reads the same two raw things --
the word32 and memory-B -- and nothing else (not A, not each other, not the new memory).
Each has its OWN two dial-papers (one for the word32, one for memory-B) and its own nudge.
All four are computed in parallel, then combined.
FRESH VALUE: tanh( word32 x Wc + memoryB x Uc + nudge_c ) -> -1..+1
KEEP-VOTER: sigmoid( word32 x Wk + memoryB x Uk + nudge_k ) -> 0..1
ADMIT-VOTER: sigmoid( word32 x Wa + memoryB x Ua + nudge_a ) -> 0..1
SHOW-VOTER: sigmoid( word32 x Ws + memoryB x Us + nudge_s ) -> 0..1
Two squashes in play, both from earlier chapters:
tanh crushes to -1..+1 (a VALUE, can be positive or negative)
sigmoid crushes to 0..1 (a FRACTION -- "how much", a dimmer knob from 0% to 100%)
ONE MACHINE -- TWO PAPERS INSIDE. Pick any formula above, say KEEP-VOTER:
sigmoid( word32 x Wk + memoryB x Uk + nudge_k )
word32 goes through paper Wk (a 32x32 matrix of dials); memory-B goes through paper Uk (a
separate 32x32 matrix). Their two outputs ADD, the nudge joins, THEN the one squash. This
is NOT a "word-machine" whose output feeds a "memory-machine" -- both papers live INSIDE the
same one machine, and the combine happens BEFORE the squash. All four machines follow this
exact shape, each with its own W, its own U, its own nudge.
Here is the quiet punchline: the FRESH VALUE machine -- tanh of (word + old recent + nudge)
-- IS the plain worker's recipe from Part 1, unchanged. The LSTM does not throw the old
worker away. It keeps it as ONE of four parts (the fresh candidate value), and wraps three
0..1 voters around it. LSTM = the old RNN cell + three dimmer knobs.
## READ Is Not MULTIPLY (the knot)
This is the single hardest knot in the whole chapter, so it gets its own table. A voter is
BORN FROM one thing and APPLIED TO another -- and they are different things.
READS (to compute its 0..1) APPLIED TO (what it scales)
-------------------------------------------------------------------------------
keep-voter word32 + memory-B memory-A (the long keep)
admit-voter word32 + memory-B fresh value
show-voter word32 + memory-B tanh(new memory-A)
fresh value word32 + memory-B (it IS the value)
Born-from is the SAME for all four (word32 + memory-B). Applied-to is DIFFERENT for each.
The trap is to think a voter reads what it scales -- it does not. The keep-voter never
looks at memory-A; it looks at today (word + recent), decides "keep 90%", and THEN does
0.9 x memory-A.
WHY HAVE MEMORY-A IF NO VOTER READS IT? Because the votes are APPLIED to it. A is the
treasure -- the long savings. The voters, deciding from today's word and the recent
memory, reach over and SCALE the treasure (keep 90% of it) and ADD to it (admit some
fresh value). Without A, nothing carries long; A is the thing the votes act ON, even
though they are decided from B.
## Loudness Is Not Worth
The most natural wrong assumption, worth its own section: a LOUDER fresh value (say +0.9,
near the top of the tanh band) must matter MORE than a quiet one. It does not. The fresh
value's SIZE and its WORTH are two separate things, decided by two separate squashes:
fresh value (tanh) -> the LOUDNESS: how big, which sign, range -1..+1
admit-voter (sigmoid) -> the WORTH: how much to let in, range 0..1 (a dimmer knob)
What lands on the long memory is loudness TIMES worth, slot by slot. Watch two slots:
LOUD but worthless: fresh +0.9 x admit-frac 0.10 = +0.09 (let in almost nothing)
QUIET but needed: fresh +0.2 x admit-frac 0.95 = +0.19 (let in nearly all)
The quiet-but-needed slot contributes MORE than the loud-but-worthless one -- +0.19
against +0.09 -- even though its raw value is far smaller. The admit-voter looked at
today's word and the recent memory, judged the loud value irrelevant right now, and choked
it down to a whisper before it ever reached memory-A. Loud does not buy a seat; the voter
sells the seat. This is exactly why the fresh value alone (the plain RNN) is not enough --
it has loudness but no separate judge of worth.
## The Combine, with Real Numbers
IN HAND: four machines have produced, for this word: a fresh value, a keep-fraction, an
admit-fraction, a show-fraction (each a row of 32 numbers). Memory-A and memory-B from
last word are on the desk. Combine them into the new A and new B.
new memory-A = keep-frac x memory-A + admit-frac x fresh value
\_ keep some old long _/ \_ let in some new _/
new memory-B = show-frac x tanh(new memory-A)
Every multiply here is PAIR-BY-PAIR (element-wise) -- slot 1 with slot 1, slot 2 with slot
2, no adding across slots. All 32 slots stay 32 slots. (The only place that ADDS across
slots is inside each machine's dial-paper, where a row of products is summed to one number.
The voter-multiplies in the combine do NOT sum.)
One number first -- the hardest part, proved for 1 slot before all 3:
old A = 5.0 (the long keep; never crushed, so it grew big)
FRESH = 0.6 (new content, tanh-bounded to -1..+1)
keep-frac = 0.9 (vote: hold 90% of A)
admit-frac = 0.2 (vote: let in 20% of FRESH)
show-frac = 0.7 (vote: speak 70% of the tamed A)
new A = 0.9 x 5.0 + 0.2 x 0.6 = 4.5 + 0.12 = 4.62
A went 5.0 -> 4.62. Held almost all of itself (4.5). Let in a touch (0.12).
tanh(4.62) ~ 1.0 (A grew large; tanh tames it back into -1..+1)
new B = show x tanh(new A) = 0.7 x 1.0 = 0.70
A = the silent vault (4.62, uncrushed, can grow big). B = what you say out loud about it
(0.70). A survives because it was only scaled-and-added, never crushed on the carry.
All 3 slots, same rule applied element-wise (a real review uses 32):
memory-A = [ 5.0 , -2.0 , 0.3 ] <- note the 5.0: never crushed, so it grew big
keep-frac = [ 0.9 , 0.5 , 1.0 ]
fresh val = [ -0.8, 0.6 , 0.2 ]
admit-frac = [ 0.3 , 0.0 , 0.9 ]
new A, slot by slot:
0.9 x 5.0 + 0.3 x (-0.8) = 4.50 - 0.24 = 4.26
0.5 x (-2.0) + 0.0 x 0.6 = -1.00 + 0 = -1.00
1.0 x 0.3 + 0.9 x 0.2 = 0.30 + 0.18 = 0.48
new memory-A = [ 4.26 , -1.00 , 0.48 ]
Read slot 2: keep-frac 0.5 halved the old -2.0 to -1.0, and admit-frac 0.0 let in NONE of
the fresh 0.6. The voter said "this slot: keep half, admit nothing." Now speak it out:
show-frac = [ 0.7 , 0.2 , 1.0 ]
tanh(new A) = [ tanh(4.26), tanh(-1.00), tanh(0.48) ] ~ [ 1.00 , -0.76 , 0.45 ]
new memory-B:
0.7 x 1.00 = 0.70
0.2 x (-0.76) = -0.15
1.0 x 0.45 = 0.45
new memory-B = [ 0.70 , -0.15 , 0.45 ]
Put new A in pocket A, new B in pocket B, move to the next word, SAME dials. Notice memory
-A's slot 1 is 4.26 -- well past 1, because it was only scaled and added, never crushed.
THAT is the escape from the fade: a far-back mark can sit in A at 5.0 and still be there
90 words later.
Show's dual role: every new B lands in two places.
1. NEXT WORD -- all four machines read it alongside the next word's note.
2. LAST WORD -- on word 100, B is the memory the final clerk reads for liked/not.
B is the LSTM's public face, built anew each word, always read by what comes next.
(A is never handed to the clerk and never read by any machine -- only used in the combine.)
And the B at the end is the WHOLE review, not the first word. Walk "nolan ended": B after
word 1 has seen only "nolan"; B after word 2 has seen "nolan" AND "ended" folded in. The
final B -- the one the clerk reads -- is the last one, so it carries every word, not just
the opener. Each B swallows the one before it.
The dial count for an LSTM worker, by hand: it is the plain worker FOUR TIMES OVER (four
machines, each with a word-paper, a memory-paper, a nudge):
one machine: 32x32 + 32x32 + 32 = 2,080
four machines: 4 x 2,080 = 8,320
(Keras reports LSTM(32) on a 32-wide note as exactly 8,320. Four little RNN cells.)
>> YOUR TURN (one slot, by hand)
memory-A slot = 4.0, keep-frac = 0.5, fresh value = 2.0, admit-frac = 1.0.
What is the new memory-A slot? Then with show-frac = 1.0 and tanh(new A) ~ 1.0
(since new A will be large), what is the new memory-B slot?
check your slate:
new A = 0.5 x 4.0 + 1.0 x 2.0 = 2.0 + 2.0 = 4.0
tanh(4.0) ~ 0.9993, call it ~1.0
new B = 1.0 x 1.0 = 1.0
Kept half the old long memory (2.0) and admitted all the fresh value (2.0).
## Build and Train the LSTM (Q5, Q6) -- One Word Changed
IN HAND: Part 1's factory was Input -> Embedding(notepad) -> SimpleRNN(32) -> Dense(1).
The LSTM factory is the SAME factory with ONE word swapped: SimpleRNN becomes LSTM.
Nothing else changes. The notepad is identical; the final clerk is identical.
floor 2 was: one worker, one memory, crush-and-rewrite (SimpleRNN)
floor 2 now: one worker, TWO memories, three voters (LSTM)
Train it exactly as the plain worker -- 5 loops, grabs of 64, read the exam score. Expect
the LSTM's exam score to be AT LEAST the plain worker's, often a little better: it
remembers far-back clues ("not ... at all") that the plain worker had let fade. The whole
point of the extra wiring is that one improvement.
## Read It Both Ways: the BiLSTM (Q7)
IN HAND: one LSTM worker walks the review front-to-back, building a memory in which word N
has heard words 1..N -- everything BEFORE it, nothing after.
But some clues sit AFTER a word:
"not boring at all" -> the "at all" comes LATER and confirms the flip
A front-to-back worker reaching "boring" has seen "not" (good) but not yet "at all". The
fix is one new idea: run TWO LSTM workers.
worker -> : walks front-to-back, builds its own final memory
worker <- : walks back-to-front, builds its own final memory
glue both final memories together, hand the doubled memory to the final clerk
Now a word's verdict can lean on clues from BOTH sides -- the words before it AND the words
after it. Two separate workers, each with its own full dial-set; the glue just lays one
final memory beside the other.
Dial count, by hand: two LSTM workers = 2 x 8,320 = 16,640. The glued memory is now 64
wide (32 from each worker), so the final clerk grows to 64 + 1 = 65 dials. (The notepad is
unchanged at 320,000.)
This is the only genuinely new idea in Q7, and it is cheap: run the same worker the other
way and glue. It is NOT a third kind of machine -- it is two LSTMs facing opposite
directions.
## All Four, Head to Head, Honestly (Q10 preview)
IN HAND: four factories now exist, all sharing the same notepad and final-clerk shape,
differing only in floor 2:
plain worker (SimpleRNN) one memory, crush-and-rewrite
LSTM (LSTM) two memories, three voters
BiLSTM (Bidirectional) two LSTMs, both directions, glued
Grade them on the ONE number that matters -- the share right on the SEALED exam pile
(val_accuracy), never on the study pile. The honest expectation, not a promise:
plain worker: a baseline; forgets far-back, so it caps out lower on long reviews
LSTM: usually >= the plain worker; the far-back memory earns its keep
BiLSTM: usually the best of the three; both-sides clues help on tricky flips
The gaps can be small on short reviews (100 words is not very long) and on only 5 loops --
the same honesty Chapter 9 needed. The ranking is the lesson, not the exact decimals. Pick
whichever scored highest on the pile you never touched while training.
How the Transformer differs from all three, in one line each:
LSTM / BiLSTM WALK word by word. Carry pockets A and B. 4 machines each word. Votes
decide what survives in A and what is spoken as B. A word's past lives in
A; its future is unknown to a plain LSTM.
TRANSFORMER NO walk. NO pocket. ALL words at once. Each word makes three tags (want /
have / give); each word checks ALL other words' "have" against its own
"want", to decide how much of each "give" to blend in. No sequence, no
memory chain, no votes. That is a later chapter.
## Honest Note: Engineered, Not Derived
Worth saying plainly, because it is true and rarely said. The plain worker's fade is a
real, derivable problem -- you can watch the memory get crushed 90 times. But the EXACT
LSTM wiring -- two memories, exactly these three voters, this particular combine -- was
ENGINEERED by trial until it stopped the fade. It was not deduced from one clean principle.
- Many variants work about equally well.
- The GRU folds the voters down to two and merges the memories; nearly as good, fewer dials.
- This precise structure is not sacred -- it is one solution that happened to work.
So: the PROBLEM (the fade) is a theorem you can prove with a pencil. The SOLUTION'S exact
shape is a CHOICE among many that work (Charter Law 7). Hold the mechanism -- scale-and-add
a long memory, gate it with dimmer knobs -- and treat the precise count of voters as a
historical accident, not a law.
## Common Tripwires
Built from the live derivation -- every confusion actually hit, logged so it stays fixed.
TRIPWIRE 1 "More is better -- a loud fresh value should count a lot."
No. A loud value (a big tanh near +1 or -1) can still be deemed worthless: the admit-voter
can set its fraction near 0 and let almost none of it in. LOUDNESS is the value; WORTH is
the voter, judged separately.
TRIPWIRE 2 "The old memory can't be 5.0 -- tanh keeps it inside -1..+1."
True for the PLAIN worker (crushed every word). FALSE for the LSTM's memory-A, which is
never crushed on the carry -- only scaled and added -- so it CAN read 5.0. That is the
whole escape from the fade.
TRIPWIRE 3 "A voter reads the thing it multiplies."
No. All four machines READ the same two things (word32 + memory-B). They are APPLIED to
DIFFERENT targets (keep -> A, admit -> fresh value, show -> tanh(new A)). Born-from same,
applied-to different.
TRIPWIRE 4 "If no voter reads memory-A, why keep A?"
Because the votes are APPLIED to A -- they scale it and add to it. A is the long-term
treasure the votes act on, even though the voters decide their fractions from B and the
word. Drop A and nothing carries far.
TRIPWIRE 5 "Two memories means two copies of the same memory."
No -- two DIFFERENT memories. A is the long keep (never crushed on the carry); B is the
spoken recent (show-frac x tanh(A)). Different roles, different numbers.
TRIPWIRE 6 "The voter-multiply sums across slots, like the dial-papers do."
No. Two different multiplies. The dial-paper inside a machine SUMS a row of products into
one number (a dot product). The combine's voter-multiplies are PAIR-BY-PAIR and do NOT
sum -- all 32 slots survive as 32 slots.
TRIPWIRE 7 "Add the nudge after the squash."
No. Inside each machine: word-part + memory-part + nudge FIRST, THEN the one squash
(tanh or sigmoid) at the end. One crush, last.
TRIPWIRE 8 "The fresh value is some new invention."
It is the plain Part 1 worker's recipe, reused unchanged as one of the four machines. The
LSTM = that old cell + three voters.
TRIPWIRE 9 "A GPU runs the words in parallel like it runs the matrix multiply."
No. The dial-paper multiply and the separate reviews run in parallel, but the WORDS must
be walked IN ORDER -- each word needs the previous word's memory. Breaking that ordered
chain is exactly what the Transformer (a later chapter) does.
TRIPWIRE 10 "The LSTM is derived from first principles, so the wiring is forced."
No. The fade is derivable; the exact wiring was engineered by trial. Variants (e.g. the
GRU) work about as well. The mechanism is the lesson, not the precise voter count.
TRIPWIRE 11 "The four machines are two machines -- a word-machine that feeds a B-machine."
No -- ONE machine, two papers inside. In KEEP-VOTER: sigmoid(word32 x Wk + memoryB x Uk
+ nudge_k), word32 and memoryB are both inputs that ADD inside the same one machine before
the squash. There is no cascade. Each of the four machines has this one-combine structure,
with its own W, its own U, its own nudge.
TRIPWIRE 12 "Memory-B is a matrix."
No. B is a VECTOR -- 32 numbers, same shape as the word32 embedding. The DIALS are the
matrices: 8 papers (Wf/Uf, Wk/Uk, Wa/Ua, Ws/Us), each 32x32. A and B are vectors; W and
U are 32x32 matrices; nudges are 32-wide vectors (one per machine).
TRIPWIRE 13 "keep = 1 minus admit -- what you don't keep, you admit."
No. There is NO subtraction here. keep and admit are TWO separate machines, each with its
own dials, each landing its own 0..1 fraction. keep 0.9 and admit 0.2 can BOTH be high, or
both low -- the slot can hold 90% of the old AND let in 20% of the new at once. (A cousin,
the GRU, DOES tie them with a 1-minus; this LSTM does not.)
TRIPWIRE 14 "The final memory-B is the first word's, or one word's."
No. Each B folds in the one before it: B after "nolan ended" has BOTH words baked in, not
just "nolan". The final B the clerk reads is the whole-review summary -- the last B, which
swallowed every earlier B. It is not tied to any single word.
## The Labels, Last
plain walking worker SimpleRNN (vanilla recurrent layer)
the fade / worn-down memory vanishing gradient over time steps
long keep, never crushed cell state (c_t)
spoken recent memory hidden state (h_t)
fresh value (tanh) candidate / cell input g_t (= the RNN cell)
keep-voter forget gate (f_t)
admit-voter input gate (i_t)
show-voter output gate (o_t)
dimmer knob 0..1 sigmoid gate
scale-and-add the long memory c_t = f_t * c_{t-1} + i_t * g_t
speak the long memory h_t = o_t * tanh(c_t)
pair-by-pair multiply element-wise (Hadamard) product
two-memory worker LSTM (Long Short-Term Memory)
simpler cousin, two voters GRU (Gated Recurrent Unit)
read both directions, glue Bidirectional wrapper
glued forward+back memory concatenated hidden states (2 x units)
share right on sealed pile validation accuracy
A, B, word-note (vectors) shape (units,) -- 32 numbers each
W, U (matrices / papers) shape (input_dim, units) -- 32x32 each
nudge per machine bias vector, shape (units,)
## Code, If You Want It
Nothing above needed a computer; this section is for the day you meet one.
Train the plain worker (Q4) -- a routine fit:
```python
history_rnn = model_rnn.fit(
X_train, y_train,
epochs=5, # 5 read-throughs of the study pile
batch_size=64, # a handful of 64 reviews per dial-turn
validation_data=(X_test, y_test), # this lab grades on the test pile
)
q4_val_acc = round(float(history_rnn.history['val_accuracy'][-1]), 3) # last loop
```
Build the LSTM (Q5) -- it is the Q3 factory with ONE word changed:
```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
model_lstm = Sequential()
model_lstm.add(Input(shape=(MAX_LEN,)))
model_lstm.add(Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM))
model_lstm.add(LSTM(32)) # the ONLY change: LSTM, not SimpleRNN
model_lstm.add(Dense(1, activation="sigmoid"))
model_lstm.compile(optimizer="adam",
loss="binary_crossentropy", metrics=["accuracy"])
# model_lstm.summary() shows floor 2 = 8,320 dials -- four little RNN cells.
```
Train the LSTM (Q6) -- same fit, expect val_acc usually >= the RNN's:
```python
history_lstm = model_lstm.fit(
X_train, y_train,
epochs=5, batch_size=64,
validation_data=(X_test, y_test),
)
q6_val_acc = round(float(history_lstm.history['val_accuracy'][-1]), 3)
```
The BiLSTM (Q7) -- wrap the LSTM to read both ways:
```python
from tensorflow.keras.layers import Bidirectional
model_bilstm = Sequential()
model_bilstm.add(Input(shape=(MAX_LEN,)))
model_bilstm.add(Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM))
model_bilstm.add(Bidirectional(LSTM(32))) # two LSTMs, opposite directions, glued
model_bilstm.add(Dense(1, activation="sigmoid"))
model_bilstm.compile(optimizer="adam",
loss="binary_crossentropy", metrics=["accuracy"])
history_bilstm = model_bilstm.fit(X_train, y_train, epochs=5, batch_size=64,
validation_data=(X_test, y_test))
q7_val_acc = round(float(history_bilstm.history['val_accuracy'][-1]), 3)
# floor 2 = 16,640 dials (2 x 8,320); the glued memory is 64 wide.
```
Compare all four val-accuracies and pick the best (Q10) -- one number per factory, read
off the sealed pile. Still ahead, in a later chapter: the Transformer (attention), which
looks at ALL words at once and kills the ordered word-by-word walk entirely.
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 10 -- Machines That Read Words):
Part 1 -- Words Into a Machine .
Part 2 (this post)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================