==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
APPENDIX . LSTM FROM PENCIL
Built From Nothing, One Number at a Time
Posted: 2026-06-14 . Author: Rahul Rai . Tags: lstm, rnn, sequence-models, by-hand
============================================================================================
PATH . APPENDIX -- LSTM From Pencil (standalone; needs no other page)
Goal: read a movie review -- plain words -- and stamp it liked (1) or not (0). Kit: a
pencil, graph paper, a few coat pockets. No computer, at any point.
How to read this page. Nothing is named before a number builds it. Every time a term first
shows up, a calculation right above it earned it. At each ">> YOUR TURN", cover everything
below that line, do it on your own paper, THEN look. If your answer fights a "WRONG TURN"
box, sit with that clash before reading on -- a clash you feel once you never forget.
We build a clerk, then words as numbers, then watch order force a memory, then a one-memory
reader and watch it FORGET, then a two-memory reader that does not. That last reader carries
a name only at its very end: LSTM.
## One Axiom: What a Clerk Does
A clerk does one job. Take some numbers in. Multiply each by a frozen dial. Add results.
Add one more fixed number, a nudge. Emit a single number out. That is all a clerk can do.
in dial add nudge out
a --------->[ x p ]\
>------( + )-----[ + b ]----> one number
c --------->[ x q ]/
Out = a*p + c*q + b. Dials p, q and nudge b stay frozen while reading; they change only
between training runs, which this page never does. Everything past here is clerks, nothing
more.
>> YOUR TURN
Dials p = 2, q = -1, nudge b = 3. Inputs a = 4, c = 5. Out?
check: 2*4 + (-1)*5 + 3 = 8 - 5 + 3 = 6.
## Words Are Not Numbers, So Hand Each Word a Row
A clerk multiplies numbers. Word "boring" is not a number. So before any clerk runs, swap
each word for a fixed row of numbers, copied off a word-book -- one page per word:
"not" --look up--> page 4017
[ 0.0 , 0.9 , -0.2 ] <- a row of numbers
Why a ROW and not a single number? Test word "not". On its own it leans neither liked nor
not -- its whole job is to FLIP whatever word comes next. A single number must pick one duty:
it can say "lean +0.3" OR "I am a flipper", never both. Two duties need two slots. So a word
gets several slots. This page uses 3 slots to fit a line; a real desk uses 32. Width is a
pick, not a law.
one number per word: can hold "lean", OR "flipper" -- not both (too thin)
a row of slots: slot1 = lean, slot2 = flipper, slot3 = ... (room for both)
## One Clerk Scores One Word, But Loses Order
Hand a clerk a word-row, let it emit one score. Fine for one word. Now feed a whole review
by adding up each word's row, slot by slot, and scoring that sum. Watch two reviews:
"not good" -> row(not) + row(good)
"good not" -> row(good) + row(not)
>> YOUR TURN
Addition does not care about order: x + y equals y + x. Do both sums land on one
same row?
check: yes -- identical. row(not)+row(good) = row(good)+row(not).
WRONG TURN "Fine, identical -- both reviews mean one same thing."
Clash: "not good" is a complaint; "good not" is noise a person would not even write. Same
numbers, opposite sense. A plain sum binned order, and order carried meaning.
So a plain sum cannot be right. Something must remember word 1 while reading word 2.
That memory is forced on us. Build it next.
## A Carried Memory, Rewritten Each Word
Keep a pocket holding a row of numbers -- a memory. Start it at all zeros. Read words left
to right. After each word, rewrite that memory from two things only: this word's row, and
memory as it stood before this word.
word_t (row)
|
v
m_old +--> [ blend this word with m_old ] --> m_new
^ |
| |
+------------------- carry ---------------------+
One pocket, looping. Word 2 sees word 1 through m_old; word 3 sees both; and so on. Order
now survives, because m_old at word 2 already holds a trace of word 1.
How does a clerk "blend"? Each output slot is one clerk reading every input slot. So 3
output slots = 3 clerks side by side. Stack each clerk's row of dials, one row per clerk,
into a 3-by-3 block. That stacked block earns a short name: a grid. Same for memory:
grid G (3 rows, 3 cols) a row r
[ g11 g12 g13 ] [ r1 ]
[ g21 g22 g23 ] applied to [ r2 ] gives 3 numbers
[ g31 g32 g33 ] [ r3 ]
out slot 1 = g11*r1 + g12*r2 + g13*r3 <- row 1 of G, multiplied into r, summed
out slot 2 = g21*r1 + g22*r2 + g23*r3
out slot 3 = g31*r1 + g32*r2 + g33*r3
That summing of paired products -- row of G against r -- is a dot. It is one clerk per grid
row. A grid cannot multiply a single number; it multiplies a ROW into a new ROW.
>> YOUR TURN
G = [ [1,0,0], [0,1,0], [0,0,1] ] (ones down a slant, zero elsewhere). r = [5, 7, 2].
Apply G to r.
check:
slot1 = 1*5 + 0*7 + 0*2 = 5
slot2 = 0*5 + 1*7 + 0*2 = 7
slot3 = 0*5 + 0*7 + 1*2 = 2 -> [5, 7, 2], copied straight through.
(That slanted grid is a do-nothing grid. A real grid carries other numbers and mixes
slots.)
## Why a Squash Sits at Each Word
Blend gives: word through grid Gw, memory through grid Gm, add a nudge row:
raw = Gw applied to word + Gm applied to m_old + nudge
Feed raw straight back as m_new and loop 100 words? Watch one slot whose memory-side dial
run multiplies by 1.5 every word, nudge aside:
1.5 * 1.5 * ... (100 times) = 1.5^100 ≈ 4 × 10^17 -- an 18-digit number.
And a slot multiplying by 0.5 each word:
0.5^100 = so small it is zero on any page.
>> YOUR TURN
Over just 10 words, a slot multiplies by 2 each word starting from 1. Value after 10?
check: 2^10 = 1024. Ten more words: past a million. It runs away.
A 100-word loop with no brake either blows up or dies. So clamp every word. Push each raw
number through a fixed bend that pins it into a band from -1 to +1:
tanh, by its numbers only (no rules to memorise, just read off):
tanh(0) = 0
tanh(1) ~ 0.76
tanh(2) ~ 0.96
tanh(5) ~ 1.0 (already pinned near top)
tanh(-5) ~ -1.0
big positive -> near +1 ; big negative -> near -1 ; zero -> 0 ; smooth between.
A bend, fixed, no dials, applied last:
m_new = tanh( Gw applied to word + Gm applied to m_old + nudge )
This is a full one-memory reader. Walk "not good": word 1 sets m from zeros; word 2 blends
"good" with that m. Order is kept; numbers stay in band. A final clerk reads last m and
stamps liked or not. Done -- for short reviews.
## This Reader Forgets (a Result You Can Prove)
Track one slot of a mark left by word 1. Say word 1 pushes 0.2 into it. Each later word,
that 0.2 gets multiplied by a memory-side dial (say ~0.5) and bent by tanh, which squeezes
small numbers toward 0:
after word 2: ~0.5 * 0.2 = 0.10, then bent smaller
after word 3: ~0.5 * that = 0.05, smaller
...
after word 30: 0.2 * 0.5^29 = a number with nine zeros after its point -- gone.
word 1 mark: 0.2 0.10 0.05 0.02 0.01 ... ~0 (worn flat)
| | | | | |
w1 w2 w3 w4 w5 w30
>> YOUR TURN
Roughly how many halvings drive 0.2 below 0.001?
check: 0.2 -> 0.1 -> 0.05 -> 0.025 -> 0.0125 -> 0.006 -> 0.003 -> 0.0016 -> 0.0008.
About 8. Far inside a 90-word review. Word 1 is long dead by word 90.
So a long review that flips on its opening word ("not ... worth ... at all") loses that
flip. This is not a dial you can tune; it follows from crushing one memory every word. Call
it a proven failure, and fix its cause.
## Carry Without Crushing: a Second Memory
We want a word-1 mark to survive 90 words. Multiply-and-crush kills it -- just proved. What
single operation lets a number ride many words unharmed? Scale by about 1 and add a little.
Nothing crushed.
So introduce a second pocket, memory A, that is NEVER bent on carry. Each word it is only
scaled by a fraction and added to:
A_new = keep * A_old + admit * fresh
keep and admit are fractions from 0 to 1, one per slot. fresh is new content for this word.
A_old --[ x keep ]--\
( + ) ---> A_new
fresh --[ x admit ]-/
>> YOUR TURN
Rule (slot-by-slot): A_new = keep * A_old + admit * fresh.
A_old = 5.0, keep = 1.0, admit = 0.0, fresh = anything. A_new? After 90 more such words?
check: A_new = 1.0*5.0 + 0.0*fresh = 5.0. Held whole. Ninety words on, still 5.0.
Nothing crushed it. Compare the one-memory crush above, where 5.0 would have worn to nothing.
WRONG TURN "A is bounded by tanh, so it cannot exceed 1 -- 5.0 is impossible."
Clash: tanh bent a memory back when we clamped every word. Memory A is built NOT to be bent
on carry -- that was its whole reason. No bend, no -1..+1 cap. A may read 5.0, even 50.
Letting it grow is exactly how a far-back mark stays alive.
## A Store Too Big to Read, So Tame a Readout
Memory A now may sit at 50. Feed 50 to a clerk and its bend saturates flat -- useless. So do
not hand A out raw. Build a tamed readout B: bend A back into band, then reveal only a
fraction of it.
B_new = show * tanh(A_new)
show is a fraction 0 to 1, one per slot. tanh(50) ~ 1.0 tames size; show picks how much of
that to speak.
A_new --[ tanh ]--[ x show ]--> B_new
Two pockets now, each with a clear duty:
A = silent store. Never read by a clerk. Only scaled-and-added. May grow past 1.
B = spoken readout. What clerks read next word. A bent, partial view of A.
>> YOUR TURN
A_new = 4.62, show = 0.7, tanh(4.62) ~ 1.0. B_new?
check: 0.7 * 1.0 = 0.70. A holds 4.62 in silence; B says 0.70 out loud about it.
## Where keep, admit, show, fresh Come From
Carry and readout forced four rows, no more, no fewer, each by a need already shown:
keep -- how much old store to hold (forced by carry)
admit -- how much new content to let in (forced by carry)
show -- how much of store to speak (forced by readout)
fresh -- new content itself (forced by carry)
Three fractions plus one content row. Where do these four come from? Each is made by its
own clerk-grid, and every one reads exactly two things -- this word's row, and B from last
word:
shared inputs to all four: word-row and B_old
fresh carries content, which may be plus or minus, so bend it with tanh into -1..+1.
keep, admit, show must each be a fraction -- "how much", 0 to 1. So each needs a maker that
takes any number in, big or small, plus or minus, and hands back something between 0 and 1,
smooth across that band. That maker carries a name -- sigmoid -- and these numbers:
sigmoid(0) = 0.5
sigmoid(2) ~ 0.88
sigmoid(-2) ~ 0.12
sigmoid(5) ~ 1.0
sigmoid(-5) ~ 0.0
big positive -> near 1 ; big negative -> near 0 ; zero -> half. A dimmer knob.
Each of four machines: word through its own word-grid, B through its own B-grid, add nudge,
one bend:
fresh = tanh( Gwf.word + Gbf.B_old + nf ) -> content, -1..+1
keep = sigmoid( Gwk.word + Gbk.B_old + nk ) -> fraction, 0..1
admit = sigmoid( Gwa.word + Gba.B_old + na ) -> fraction, 0..1
show = sigmoid( Gws.word + Gbs.B_old + ns ) -> fraction, 0..1
Read one machine slowly:
word --->[ Gwk ]--\
( + )---[ + nk ]---[ sigmoid ]---> keep
B_old -->[ Gbk ]--/
Two grids sit INSIDE one machine -- one for word, one for B -- adding before a single bend.
This is not a word-machine feeding a memory-machine; it is one machine with two inputs.
WRONG TURN "Bend it twice -- tanh then sigmoid -- to be safe."
Clash: tanh makes a signed value; sigmoid makes a fraction. A slot is one kind or other,
never both. fresh wants signed content (tanh). keep/admit/show want a fraction (sigmoid).
One bend each, picked by duty.
## Combine, in Real Numbers
Per slot, four machines hand you: fresh, keep, admit, show. Memory A and B from last word
sit in pockets. Two lines finish it:
A_new = keep * A_old + admit * fresh
B_new = show * tanh(A_new)
A_old --[ x keep ]--\
( + ) --> A_new --[ tanh ]--[ x show ]--> B_new
fresh --[ x admit ]-/ |
v
B_new feeds next word's four machines,
and on last word, feeds final stamp.
One slot, real numbers:
A_old = 5.0 fresh = 0.6 keep = 0.9 admit = 0.2 show = 0.7
A_new = 0.9*5.0 + 0.2*0.6 = 4.5 + 0.12 = 4.62
B_new = 0.7 * tanh(4.62) = 0.7 * ~1.0 = 0.70
A went 5.0 -> 4.62: scaled by 0.9, nudged up 0.12, never bent. A mark can sit high and ride
on. To speak it: bend (4.62 -> ~1.0), then reveal 0.7 of it -> 0.70.
Each multiply is slot-by-slot -- slot 1 with slot 1 -- no summing across slots. Summing lives
only inside a grid row (the grid, built earlier). Three slots at once, each carrying its own
keep/admit/show:
A_old = [ 5.0 , -2.0 , 0.3 ] fresh = [ 0.6 , 0.8 , -0.1 ]
keep = [ 0.9 , 0.5 , 1.0 ] admit = [ 0.2 , 0.0 , 0.9 ]
show = [ 0.7 , 0.2 , 1.0 ]
A_new: 0.9*5.0 + 0.2*0.6 = 4.62
0.5*-2.0 + 0.0*0.8 = -1.00 <- slot 2: kept half, admitted nothing
1.0*0.3 + 0.9*-0.1 = 0.21
A_new = [ 4.62 , -1.00 , 0.21 ]
tanh(A_new) ~ [ 1.00 , -0.76 , 0.21 ]
B_new: 0.7*1.00 = 0.70 , 0.2*-0.76 = -0.15 , 1.0*0.21 = 0.21
B_new = [ 0.70 , -0.15 , 0.21 ]
>> YOUR TURN
Rules (slot-by-slot): A_new = keep * A_old + admit * fresh, then B_new = show * tanh(A_new).
One slot: A_old = 4.0, keep = 0.5, fresh = 2.0, admit = 1.0, show = 1.0,
tanh(A_new) ~ 1.0. Find A_new, then B_new.
check:
A_new = 0.5*4.0 + 1.0*2.0 = 2.0 + 2.0 = 4.0
B_new = 1.0 * 1.0 = 1.0
Kept half a long store, admitted all new content.
## One Word, Start to Finish
Put it in one block. First word, so A and B are zeros:
first look up word-row. pull A (zeros), pull B (zeros).
then run four machines on (word-row + B):
fresh = tanh( Gwf.word + Gbf.B + nf )
keep = sigmoid( Gwk.word + Gbk.B + nk )
admit = sigmoid( Gwa.word + Gba.B + na )
show = sigmoid( Gws.word + Gbs.B + ns )
then A_new = keep * A + admit * fresh (A touched here, first time)
then B_new = show * tanh(A_new)
then put A_new, B_new back. move to next word. SAME dials. repeat.
last final clerk reads last B -> liked (1) or not (0).
B after word 1 holds word 1 only. B after word 2 folds in words 1 and 2. Last B holds all
words -- a whole-review readout, not one word's. Each B swallows its forerunner.
This reader earns its name now, at its end, having been built piece by piece: a Long Short-
Term Memory reader -- LSTM. "Long" is store A, riding far without a bend. "Short" is readout
B, rebuilt every word. Nothing in that name was needed to build it; a number built each part.
## A Word on What Was a Pick, Not a Proof
Forgetting is a result, provable with a pencil, as you did. A COUNT of three fractions named
keep/admit/show, and this exact wiring, is a pick that happened to read well. A leaner cousin
ties keep and admit into one and merges A with B; it reads about as well on fewer dials. So
treat a count of three gates as a desk choice, not a law.
But scale-and-add itself is NOT a mere pick -- it has a reason, and one more layer shows it:
DEEPER, optional (skip with no loss if calculus is new to you). Earlier a far mark shrank
because each word multiplied it by a number below 1, and many such factors below 1 drag a
running product to 0. In the two-memory carry, A_new = keep * A_old + admit * fresh; set keep
= 1 and a far mark rides on with a carry factor of exactly 1, never below it. A chain of 1s
stays 1, so a link from word 90 back to word 1 neither dies nor blows up. That unit carry
factor, not luck, is why an added, uncrushed store cures that decay -- and why every gated
reader of this kind keeps an additive store at its core.
## Traps, Each as a Clash to Sit With
Stated as a wrong line, then a clash, then a fix. Read a wrong line, find a clash yourself,
then check.
WRONG "A grid times one number."
clash: a grid has 3 rows; one number gives one product per row -- but a row needs a
dot of paired numbers. fix: grid times a ROW; each grid-row dotted into it -> a row.
WRONG "Combine sums across slots."
clash: that would collapse 3 slots to 1, losing per-slot keep/admit. fix: slot-by-
slot multiply, no cross-slot sum. Summing lives only inside a grid row.
WRONG "A fraction reads what it scales."
clash: keep scales A, yet A is never read by a machine. fix: all four read word + B;
each is applied elsewhere -- keep to A, admit to fresh, show to tanh(A_new).
WRONG "Two memories means two copies."
clash: a copy of A would also sit at 4.62; B sits at 0.70. fix: different rows,
different duty -- A silent store, B spoken readout.
WRONG "Drop A; no machine reads it anyway."
clash: B is built from A (show * tanh(A_new)); drop A and B has no source, and
nothing rides far. fix: A is acted ON (keep*A, +admit*fresh); A feeds B.
WRONG "B is a grid."
clash: a grid is dials (3 by 3); B changes every word and is a row of 3. fix: A and B
are rows; grids (Gw, Gb) are dials.
WRONG "keep = 1 minus admit."
clash: no line on this page subtracts; keep and admit came from two separate
machines. fix: both are free fractions; both may sit high at once.
WRONG "A loud fresh must count most."
clash: fresh 0.9 with admit 0.0 adds 0.0; fresh 0.2 with admit 0.95 adds 0.19. fix:
size is fresh; worth is admit, judged apart.
WRONG "Final B is word 1's readout."
clash: B at word 2 already folded in word 2. fix: last B holds every word -- a whole-
review readout.
## Two Lines to Carry Away
One memory, bent every word: a mark from word 1 wears to nothing by word 90. Proven, not
tuned.
Two memories: a silent store A, only scaled-and-added so a far mark rides far; a spoken
readout B = show * tanh(A); four machines (fresh + keep/admit/show) reading word + B; same
dials every word; last B stamps liked or not. That reader is an LSTM.
----------------------------------------------------------------------------------------------
SEE ALSO (Chapter 10 -- Machines That Read Words):
Part 1 -- Words Into a Machine: The Notepad and the Walking Worker .
Part 2 -- The Two-Memory Worker: How an LSTM Remembers Far-Back Words
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================