LSTM From Pencil: Built From Nothing, One Number at a Time

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  APPENDIX . LSTM FROM PENCIL
  Built From Nothing, One Number at a Time
  Posted: 2026-06-14 . Author: Rahul Rai . Tags: lstm, rnn, sequence-models, by-hand
  ============================================================================================

  PATH . APPENDIX -- LSTM From Pencil  (standalone; needs no other page)

  Goal: read a movie review -- plain words -- and stamp it liked (1) or not (0). Kit: a
  pencil, graph paper, a few coat pockets. No computer, at any point.

  How to read this page. Nothing is named before a number builds it. Every time a term first
  shows up, a calculation right above it earned it. At each ">> YOUR TURN", cover everything
  below that line, do it on your own paper, THEN look. If your answer fights a "WRONG TURN"
  box, sit with that clash before reading on -- a clash you feel once you never forget.

  We build a clerk, then words as numbers, then watch order force a memory, then a one-memory
  reader and watch it FORGET, then a two-memory reader that does not. That last reader carries
  a name only at its very end: LSTM.


  ## One Axiom: What a Clerk Does

  A clerk does one job. Take some numbers in. Multiply each by a frozen dial. Add results.
  Add one more fixed number, a nudge. Emit a single number out. That is all a clerk can do.

      in            dial           add      nudge        out
      a  --------->[ x p ]\
                           >------( + )-----[ + b ]----> one number
      c  --------->[ x q ]/

  Out = a*p + c*q + b. Dials p, q and nudge b stay frozen while reading; they change only
  between training runs, which this page never does. Everything past here is clerks, nothing
  more.

    >> YOUR TURN
       Dials p = 2, q = -1, nudge b = 3. Inputs a = 4, c = 5. Out?

       check: 2*4 + (-1)*5 + 3 = 8 - 5 + 3 = 6.


  ## Words Are Not Numbers, So Hand Each Word a Row

  A clerk multiplies numbers. Word "boring" is not a number. So before any clerk runs, swap
  each word for a fixed row of numbers, copied off a word-book -- one page per word:

      "not"  --look up-->  page 4017
                           [ 0.0 , 0.9 , -0.2 ]      <- a row of numbers

  Why a ROW and not a single number? Test word "not". On its own it leans neither liked nor
  not -- its whole job is to FLIP whatever word comes next. A single number must pick one duty:
  it can say "lean +0.3" OR "I am a flipper", never both. Two duties need two slots. So a word
  gets several slots. This page uses 3 slots to fit a line; a real desk uses 32. Width is a
  pick, not a law.

      one number per word:   can hold "lean", OR "flipper"   -- not both   (too thin)
      a row of slots:        slot1 = lean, slot2 = flipper, slot3 = ...    (room for both)


  ## One Clerk Scores One Word, But Loses Order

  Hand a clerk a word-row, let it emit one score. Fine for one word. Now feed a whole review
  by adding up each word's row, slot by slot, and scoring that sum. Watch two reviews:

      "not good"   ->  row(not) + row(good)
      "good not"   ->  row(good) + row(not)

    >> YOUR TURN
       Addition does not care about order: x + y equals y + x. Do both sums land on one
       same row?

       check: yes -- identical. row(not)+row(good) = row(good)+row(not).

  WRONG TURN  "Fine, identical -- both reviews mean one same thing."
    Clash: "not good" is a complaint; "good not" is noise a person would not even write. Same
    numbers, opposite sense. A plain sum binned order, and order carried meaning.
    So a plain sum cannot be right. Something must remember word 1 while reading word 2.

  That memory is forced on us. Build it next.


  ## A Carried Memory, Rewritten Each Word

  Keep a pocket holding a row of numbers -- a memory. Start it at all zeros. Read words left
  to right. After each word, rewrite that memory from two things only: this word's row, and
  memory as it stood before this word.

      word_t (row)
         |
         v
   m_old +--> [  blend this word with m_old  ] --> m_new
     ^                                               |
     |                                               |
     +------------------- carry ---------------------+

  One pocket, looping. Word 2 sees word 1 through m_old; word 3 sees both; and so on. Order
  now survives, because m_old at word 2 already holds a trace of word 1.

  How does a clerk "blend"? Each output slot is one clerk reading every input slot. So 3
  output slots = 3 clerks side by side. Stack each clerk's row of dials, one row per clerk,
  into a 3-by-3 block. That stacked block earns a short name: a grid. Same for memory:

      grid G (3 rows, 3 cols)        a row r
        [ g11 g12 g13 ]                [ r1 ]
        [ g21 g22 g23 ]   applied to   [ r2 ]   gives 3 numbers
        [ g31 g32 g33 ]                [ r3 ]

      out slot 1 = g11*r1 + g12*r2 + g13*r3      <- row 1 of G, multiplied into r, summed
      out slot 2 = g21*r1 + g22*r2 + g23*r3
      out slot 3 = g31*r1 + g32*r2 + g33*r3

  That summing of paired products -- row of G against r -- is a dot. It is one clerk per grid
  row. A grid cannot multiply a single number; it multiplies a ROW into a new ROW.

    >> YOUR TURN
       G = [ [1,0,0], [0,1,0], [0,0,1] ] (ones down a slant, zero elsewhere). r = [5, 7, 2].
       Apply G to r.

       check:
         slot1 = 1*5 + 0*7 + 0*2 = 5
         slot2 = 0*5 + 1*7 + 0*2 = 7
         slot3 = 0*5 + 0*7 + 1*2 = 2     ->  [5, 7, 2], copied straight through.
       (That slanted grid is a do-nothing grid. A real grid carries other numbers and mixes
       slots.)


  ## Why a Squash Sits at Each Word

  Blend gives: word through grid Gw, memory through grid Gm, add a nudge row:

      raw = Gw applied to word  +  Gm applied to m_old  +  nudge

  Feed raw straight back as m_new and loop 100 words? Watch one slot whose memory-side dial
  run multiplies by 1.5 every word, nudge aside:

      1.5 * 1.5 * ... (100 times) = 1.5^100 ≈ 4 × 10^17 -- an 18-digit number.

  And a slot multiplying by 0.5 each word:

      0.5^100 = so small it is zero on any page.

    >> YOUR TURN
       Over just 10 words, a slot multiplies by 2 each word starting from 1. Value after 10?

       check: 2^10 = 1024. Ten more words: past a million. It runs away.

  A 100-word loop with no brake either blows up or dies. So clamp every word. Push each raw
  number through a fixed bend that pins it into a band from -1 to +1:

      tanh, by its numbers only (no rules to memorise, just read off):
        tanh(0) = 0
        tanh(1) ~ 0.76
        tanh(2) ~ 0.96
        tanh(5) ~ 1.0       (already pinned near top)
        tanh(-5) ~ -1.0
      big positive -> near +1 ;  big negative -> near -1 ;  zero -> 0 ;  smooth between.

  A bend, fixed, no dials, applied last:

      m_new = tanh(  Gw applied to word  +  Gm applied to m_old  +  nudge  )

  This is a full one-memory reader. Walk "not good": word 1 sets m from zeros; word 2 blends
  "good" with that m. Order is kept; numbers stay in band. A final clerk reads last m and
  stamps liked or not. Done -- for short reviews.


  ## This Reader Forgets (a Result You Can Prove)

  Track one slot of a mark left by word 1. Say word 1 pushes 0.2 into it. Each later word,
  that 0.2 gets multiplied by a memory-side dial (say ~0.5) and bent by tanh, which squeezes
  small numbers toward 0:

      after word 2:  ~0.5 * 0.2  = 0.10, then bent smaller
      after word 3:  ~0.5 * that = 0.05, smaller
      ...
      after word 30:  0.2 * 0.5^29  =  a number with nine zeros after its point -- gone.

      word 1 mark:  0.2  0.10  0.05  0.02  0.01  ...  ~0   (worn flat)
                     |     |     |     |     |          |
                    w1    w2    w3    w4    w5         w30

    >> YOUR TURN
       Roughly how many halvings drive 0.2 below 0.001?

       check: 0.2 -> 0.1 -> 0.05 -> 0.025 -> 0.0125 -> 0.006 -> 0.003 -> 0.0016 -> 0.0008.
       About 8. Far inside a 90-word review. Word 1 is long dead by word 90.

  So a long review that flips on its opening word ("not ... worth ... at all") loses that
  flip. This is not a dial you can tune; it follows from crushing one memory every word. Call
  it a proven failure, and fix its cause.


  ## Carry Without Crushing: a Second Memory

  We want a word-1 mark to survive 90 words. Multiply-and-crush kills it -- just proved. What
  single operation lets a number ride many words unharmed? Scale by about 1 and add a little.
  Nothing crushed.

  So introduce a second pocket, memory A, that is NEVER bent on carry. Each word it is only
  scaled by a fraction and added to:

      A_new = keep * A_old  +  admit * fresh

  keep and admit are fractions from 0 to 1, one per slot. fresh is new content for this word.

      A_old --[ x keep ]--\
                           ( + ) ---> A_new
      fresh --[ x admit ]-/

    >> YOUR TURN
       Rule (slot-by-slot): A_new = keep * A_old + admit * fresh.
       A_old = 5.0, keep = 1.0, admit = 0.0, fresh = anything. A_new? After 90 more such words?

       check: A_new = 1.0*5.0 + 0.0*fresh = 5.0. Held whole. Ninety words on, still 5.0.
       Nothing crushed it. Compare the one-memory crush above, where 5.0 would have worn to nothing.

  WRONG TURN  "A is bounded by tanh, so it cannot exceed 1 -- 5.0 is impossible."
    Clash: tanh bent a memory back when we clamped every word. Memory A is built NOT to be bent
    on carry -- that was its whole reason. No bend, no -1..+1 cap. A may read 5.0, even 50.
    Letting it grow is exactly how a far-back mark stays alive.


  ## A Store Too Big to Read, So Tame a Readout

  Memory A now may sit at 50. Feed 50 to a clerk and its bend saturates flat -- useless. So do
  not hand A out raw. Build a tamed readout B: bend A back into band, then reveal only a
  fraction of it.

      B_new = show * tanh(A_new)

  show is a fraction 0 to 1, one per slot. tanh(50) ~ 1.0 tames size; show picks how much of
  that to speak.

      A_new --[ tanh ]--[ x show ]--> B_new

  Two pockets now, each with a clear duty:

      A = silent store. Never read by a clerk. Only scaled-and-added. May grow past 1.
      B = spoken readout. What clerks read next word. A bent, partial view of A.

    >> YOUR TURN
       A_new = 4.62, show = 0.7, tanh(4.62) ~ 1.0. B_new?

       check: 0.7 * 1.0 = 0.70. A holds 4.62 in silence; B says 0.70 out loud about it.


  ## Where keep, admit, show, fresh Come From

  Carry and readout forced four rows, no more, no fewer, each by a need already shown:

      keep  -- how much old store to hold       (forced by carry)
      admit -- how much new content to let in    (forced by carry)
      show  -- how much of store to speak        (forced by readout)
      fresh -- new content itself                (forced by carry)

  Three fractions plus one content row. Where do these four come from? Each is made by its
  own clerk-grid, and every one reads exactly two things -- this word's row, and B from last
  word:

      shared inputs to all four:   word-row   and   B_old

  fresh carries content, which may be plus or minus, so bend it with tanh into -1..+1.
  keep, admit, show must each be a fraction -- "how much", 0 to 1. So each needs a maker that
  takes any number in, big or small, plus or minus, and hands back something between 0 and 1,
  smooth across that band. That maker carries a name -- sigmoid -- and these numbers:

        sigmoid(0) = 0.5
        sigmoid(2) ~ 0.88
        sigmoid(-2) ~ 0.12
        sigmoid(5) ~ 1.0
        sigmoid(-5) ~ 0.0
      big positive -> near 1 ;  big negative -> near 0 ;  zero -> half.  A dimmer knob.

  Each of four machines: word through its own word-grid, B through its own B-grid, add nudge,
  one bend:

      fresh = tanh(    Gwf.word + Gbf.B_old + nf )    -> content, -1..+1
      keep  = sigmoid( Gwk.word + Gbk.B_old + nk )    -> fraction, 0..1
      admit = sigmoid( Gwa.word + Gba.B_old + na )    -> fraction, 0..1
      show  = sigmoid( Gws.word + Gbs.B_old + ns )    -> fraction, 0..1

  Read one machine slowly:

      word --->[ Gwk ]--\
                         ( + )---[ + nk ]---[ sigmoid ]---> keep
      B_old -->[ Gbk ]--/

  Two grids sit INSIDE one machine -- one for word, one for B -- adding before a single bend.
  This is not a word-machine feeding a memory-machine; it is one machine with two inputs.

  WRONG TURN  "Bend it twice -- tanh then sigmoid -- to be safe."
    Clash: tanh makes a signed value; sigmoid makes a fraction. A slot is one kind or other,
    never both. fresh wants signed content (tanh). keep/admit/show want a fraction (sigmoid).
    One bend each, picked by duty.


  ## Combine, in Real Numbers

  Per slot, four machines hand you: fresh, keep, admit, show. Memory A and B from last word
  sit in pockets. Two lines finish it:

      A_new = keep * A_old  +  admit * fresh
      B_new = show * tanh(A_new)

      A_old --[ x keep ]--\
                           ( + ) --> A_new --[ tanh ]--[ x show ]--> B_new
      fresh --[ x admit ]-/                                  |
                                                             v
                                       B_new feeds next word's four machines,
                                       and on last word, feeds final stamp.

  One slot, real numbers:

      A_old = 5.0   fresh = 0.6   keep = 0.9   admit = 0.2   show = 0.7

      A_new = 0.9*5.0 + 0.2*0.6 = 4.5 + 0.12 = 4.62
      B_new = 0.7 * tanh(4.62) = 0.7 * ~1.0 = 0.70

  A went 5.0 -> 4.62: scaled by 0.9, nudged up 0.12, never bent. A mark can sit high and ride
  on. To speak it: bend (4.62 -> ~1.0), then reveal 0.7 of it -> 0.70.

  Each multiply is slot-by-slot -- slot 1 with slot 1 -- no summing across slots. Summing lives
  only inside a grid row (the grid, built earlier). Three slots at once, each carrying its own
  keep/admit/show:

      A_old = [ 5.0 , -2.0 , 0.3 ]    fresh = [ 0.6 , 0.8 , -0.1 ]
      keep  = [ 0.9 ,  0.5 , 1.0 ]    admit = [ 0.2 , 0.0 ,  0.9 ]
      show  = [ 0.7 ,  0.2 , 1.0 ]

      A_new:  0.9*5.0  + 0.2*0.6  = 4.62
              0.5*-2.0 + 0.0*0.8  = -1.00      <- slot 2: kept half, admitted nothing
              1.0*0.3  + 0.9*-0.1 = 0.21
      A_new = [ 4.62 , -1.00 , 0.21 ]

      tanh(A_new) ~ [ 1.00 , -0.76 , 0.21 ]
      B_new:  0.7*1.00 = 0.70 ,  0.2*-0.76 = -0.15 ,  1.0*0.21 = 0.21
      B_new = [ 0.70 , -0.15 , 0.21 ]

    >> YOUR TURN
       Rules (slot-by-slot): A_new = keep * A_old + admit * fresh, then B_new = show * tanh(A_new).
       One slot: A_old = 4.0, keep = 0.5, fresh = 2.0, admit = 1.0, show = 1.0,
       tanh(A_new) ~ 1.0. Find A_new, then B_new.

       check:
         A_new = 0.5*4.0 + 1.0*2.0 = 2.0 + 2.0 = 4.0
         B_new = 1.0 * 1.0 = 1.0
       Kept half a long store, admitted all new content.


  ## One Word, Start to Finish

  Put it in one block. First word, so A and B are zeros:

      first   look up word-row.  pull A (zeros), pull B (zeros).
      then    run four machines on (word-row + B):
                fresh = tanh(    Gwf.word + Gbf.B + nf )
                keep  = sigmoid( Gwk.word + Gbk.B + nk )
                admit = sigmoid( Gwa.word + Gba.B + na )
                show  = sigmoid( Gws.word + Gbs.B + ns )
      then    A_new = keep * A + admit * fresh      (A touched here, first time)
      then    B_new = show * tanh(A_new)
      then    put A_new, B_new back. move to next word. SAME dials. repeat.
      last    final clerk reads last B -> liked (1) or not (0).

  B after word 1 holds word 1 only. B after word 2 folds in words 1 and 2. Last B holds all
  words -- a whole-review readout, not one word's. Each B swallows its forerunner.

  This reader earns its name now, at its end, having been built piece by piece: a Long Short-
  Term Memory reader -- LSTM. "Long" is store A, riding far without a bend. "Short" is readout
  B, rebuilt every word. Nothing in that name was needed to build it; a number built each part.


  ## A Word on What Was a Pick, Not a Proof

  Forgetting is a result, provable with a pencil, as you did. A COUNT of three fractions named
  keep/admit/show, and this exact wiring, is a pick that happened to read well. A leaner cousin
  ties keep and admit into one and merges A with B; it reads about as well on fewer dials. So
  treat a count of three gates as a desk choice, not a law.

  But scale-and-add itself is NOT a mere pick -- it has a reason, and one more layer shows it:

  DEEPER, optional (skip with no loss if calculus is new to you). Earlier a far mark shrank
  because each word multiplied it by a number below 1, and many such factors below 1 drag a
  running product to 0. In the two-memory carry, A_new = keep * A_old + admit * fresh; set keep
  = 1 and a far mark rides on with a carry factor of exactly 1, never below it. A chain of 1s
  stays 1, so a link from word 90 back to word 1 neither dies nor blows up. That unit carry
  factor, not luck, is why an added, uncrushed store cures that decay -- and why every gated
  reader of this kind keeps an additive store at its core.


  ## Traps, Each as a Clash to Sit With

  Stated as a wrong line, then a clash, then a fix. Read a wrong line, find a clash yourself,
  then check.

      WRONG  "A grid times one number."
        clash: a grid has 3 rows; one number gives one product per row -- but a row needs a
        dot of paired numbers. fix: grid times a ROW; each grid-row dotted into it -> a row.

      WRONG  "Combine sums across slots."
        clash: that would collapse 3 slots to 1, losing per-slot keep/admit. fix: slot-by-
        slot multiply, no cross-slot sum. Summing lives only inside a grid row.

      WRONG  "A fraction reads what it scales."
        clash: keep scales A, yet A is never read by a machine. fix: all four read word + B;
        each is applied elsewhere -- keep to A, admit to fresh, show to tanh(A_new).

      WRONG  "Two memories means two copies."
        clash: a copy of A would also sit at 4.62; B sits at 0.70. fix: different rows,
        different duty -- A silent store, B spoken readout.

      WRONG  "Drop A; no machine reads it anyway."
        clash: B is built from A (show * tanh(A_new)); drop A and B has no source, and
        nothing rides far. fix: A is acted ON (keep*A, +admit*fresh); A feeds B.

      WRONG  "B is a grid."
        clash: a grid is dials (3 by 3); B changes every word and is a row of 3. fix: A and B
        are rows; grids (Gw, Gb) are dials.

      WRONG  "keep = 1 minus admit."
        clash: no line on this page subtracts; keep and admit came from two separate
        machines. fix: both are free fractions; both may sit high at once.

      WRONG  "A loud fresh must count most."
        clash: fresh 0.9 with admit 0.0 adds 0.0; fresh 0.2 with admit 0.95 adds 0.19. fix:
        size is fresh; worth is admit, judged apart.

      WRONG  "Final B is word 1's readout."
        clash: B at word 2 already folded in word 2. fix: last B holds every word -- a whole-
        review readout.


  ## Two Lines to Carry Away

  One memory, bent every word: a mark from word 1 wears to nothing by word 90. Proven, not
  tuned.

  Two memories: a silent store A, only scaled-and-added so a far mark rides far; a spoken
  readout B = show * tanh(A); four machines (fresh + keep/admit/show) reading word + B; same
  dials every word; last B stamps liked or not. That reader is an LSTM.


----------------------------------------------------------------------------------------------
  SEE ALSO (Chapter 10 -- Machines That Read Words):
    Part 1 -- Words Into a Machine: The Notepad and the Walking Worker .
    Part 2 -- The Two-Memory Worker: How an LSTM Remembers Far-Back Words

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================