The Look-Across Machine: Attention and the Transformer by Pencil (Chapter 10, Part 3)

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 10 . MACHINES THAT READ WORDS . PART 3 OF 3
  The Look-Across Machine: Attention and the Transformer by Pencil
  Posted: 2026-06-15 . Author: Rahul Rai . Tags: transformer, attention, nlp, sequence-models
  ============================================================================================

  PATH . post 29 of 31
    <- prev:  Chapter 10, Part 2: The Two-Memory Worker
       next:  Chapter 11, Part 1: The Vision Transformer ->

  Parts 1 and 2 built two walking machines. The plain RNN read 100 word-notes in order,
  rewriting one 32-number memory at each word. By word 90, word 1's signal had been tanh-crushed
  89 times and was nearly zero. The LSTM fixed the fade by carrying a second uncrushed memory
  on a belt, three voters -- keep, admit, show -- deciding what passes. Both cure the fade.
  Neither cures the walk: word 2 must wait for word 1, word 3 for word 2. On a GPU those 100
  serial steps are a bottleneck. And "not" (word 1) linking to "good" (word 80) still rides 79
  rewrites, even on the best belt.

  This post kills the walk. No memory, no chain. Lay all 100 words out at once. Let every word
  look DIRECTLY at every other word -- near or far, same cost, all at the same time. That is
  the core of the Transformer. "not" connects to "good" in one step, no chain, no fade. And
  because there is no chain, every word computes at the same time -- parallel -- fast on a GPU.
  That parallelism is precisely why this architecture took over: it scales.

  Worked review throughout: "nolan ended" (2 words, to keep pencil arithmetic short). Tag width
  4 here, 32 in the real lab -- every move is identical. Pencil and scratch paper ready.


  ## Words Are Still Not Numbers -- Each Gets a Stick

  A clerk multiplies numbers. The word "nolan" is not a number, so before anything runs the
  words get numbered. Walk every review, count every word, hand each a number by how often it
  shows up. Most common word = 1. This counted list is the DICTIONARY.

      THE DICTIONARY (everything seen, ranked by count):
        the=1   movie=2   was=3   boring=4  ...  nolan=73  ...  ended=88  ...

  Keep only the top 10,000 entries (VOCAB_SIZE = a number we pick). Every rarer word shares one
  "unknown" slot. Pad or chop every review to exactly 100 slots (MAX_LEN = 100):

      "nolan ended"  →  [73, 88, 0, 0, ... , 0]    98 zeros appended     → 100 numbers
      500-word rant  →  [first 100 numbers]          400 words chopped

  But 73 is only a name-tag. A clerk that multiplied it would count "nolan" (73) as 73 times
  heavier than "the" (1) -- false. So swap each bare number for a LIST of numbers, its STICK.

  Two shelves: the full dictionary holds every word ever seen (~80,000). The embedding table
  is a smaller cabinet -- the top 10,000 words each get their own locker of numbers; every
  rarer word shares one unknown locker.

      FULL DICTIONARY:                    EMBEDDING TABLE (10,000 lockers):
      ┌──────────────────────────┐          ┌────────────────────────────┐
      │ the=1  movie=2  was=3   │          │ locker 1  → [32 numbers]  │
      │ boring=4  ... nolan=73  │  ──────► │ locker 4  → [32 numbers]  boring
      │ ... [~80,000 words] ... │  top     │ locker 73 → [32 numbers]  nolan
      └──────────────────────────┘  10,000 │ <unknown> → [32 numbers]  ← rare words pile here
                                           └────────────────────────────┘

  In this pencil run (width 4):

      word 73 (nolan) →  [2, 1, 1, 0]    ← nolan's stick
      word 88 (ended) →  [0, 1, 2, 1]    ← ended's stick

  10,000 × 4 = 40,000 entries in the table (320,000 at real width 32). These start as junk and
  get tuned by training: "boring" and "dull" drift close, "nolan" and "the" drift far apart.

  After this: every word-slot has a stick. "nolan ended" → a SHEET 2 tall × 4 wide.

    >> YOUR TURN
       A review says "nolan qxzbr ended". "qxzbr" ranked 400,000th, far outside the top 10,000.
       What number does it get in the padded review row?

       check: the one shared "unknown" number.


  ## The Walk Still Fades, So Look Across Instead

  The LSTM walked. "not" (word 1) reaches "good" (word 80) only by riding the memory belt
  through 79 rewrites. The transformer removes the walk entirely: lay all 100 words out at once
  and let every word look straight at every other word, near or far, same cost.

      WALK:        not → mem → ... 79 rewrites ... → good      fades; serial
      LOOK-ACROSS: not  ←────────── direct ──────────→ good    no fade; all at once

  "not" and "good" connect in one step. And because there is no chain, every word does its
  looking at the same time -- parallel on a GPU. No waiting, no walking, no fade. Three moves
  build that look into a real number.


  ## Three Tags Let Each Word Ask, Answer, and Give

  For two words to look at each other, each needs something to ask WITH, something to get
  matched AGAINST, and something to hand over IF picked. So each word makes three new sticks
  from its embedding, each through its own square grid of dials:

      WANT (query) -- "what am I looking for?"        ← does the looking
      HAVE (key)   -- "what do I offer?"              ← gets looked at
      GIVE (value) -- "what I hand over if picked"

  Why three: asking is a different job from being matched, and what you advertise (HAVE) should
  not be the same thing as what you actually hand over (GIVE). Three jobs, three sticks.

  How one tag is made. A grid row is a list of dials. Row "times" a stick: multiply matching
  numbers, add → one output number. All rows → the new stick. Here is nolan's WANT worked out:

      nolan's stick  = [2, 1, 1, 0]

      WANT-grid rows (a 4×4 grid of dials, made-up here):
        row 1 = [1, 0, 0, 0]  →  1·2 + 0·1 + 0·1 + 0·0 = 2
        row 2 = [0, 0, 0, 0]  →  0
        row 3 = [0, 0, 1, 0]  →  0·2 + 0·1 + 1·1 + 0·0 = 1
        row 4 = [0, 0, 0, 0]  →  0

      → nolan.WANT = [2, 0, 1, 0]

  The same three grids run over every word in the review -- one reusable set of grids applied
  to all. Below are the HAVE and GIVE values used throughout this pencil run (made-up so the
  arithmetic stays short):

      nolan.HAVE = [1, 0, 0, 0]    nolan.GIVE = [2, 0, 0, 1]
      ended.HAVE = [3, 0, 2, 0]    ended.GIVE = [0, 3, 1, 0]

    >> YOUR TURN
       A WANT-grid row is [0, 1, 0, 0]. nolan's stick is [2, 1, 1, 0]. Row times stick = ?

       check: 0·2 + 1·1 + 0·1 + 0·0 = 1.


  ## nolan's WANT Scores Every Word, Including Itself

  How much does nolan care about a word? Dot nolan's WANT against that word's HAVE: multiply
  matching numbers, add → one number. Both sticks are the same width (4), so they line up;
  read one flat as a row, the other as a column, multiply pairs, add.

      nolan.WANT = [2, 0, 1, 0]

      vs nolan.HAVE = [1, 0, 0, 0]:   2·1 + 0·0 + 1·0 + 0·0  =  2
      vs ended.HAVE = [3, 0, 2, 0]:   2·3 + 0·0 + 1·2 + 0·0  =  6 + 2  =  8

      nolan's match row:  [2, 8]

  nolan scores EVERY word including itself -- the match row has n entries (here 2), not n−1.
  Bigger number = nolan's WANT matches that word's HAVE better.

    >> YOUR TURN
       nolan.WANT = [2, 0, 1, 0]. Some word has HAVE = [1, 0, 1, 0]. Match = ?

       check: 2·1 + 0·0 + 1·1 + 0·0 = 3.

  WRONG TURN  "Two sticks of the same width can't be dotted -- they'd need different shapes."
  ─────────────────────────────────────────────────────────────────────────────────────────
  Lay one flat as a row: [2, 0, 1, 0]. Stand the other as a column: [1, 0, 0, 0]^T.
  Multiply matching pairs (2·1, 0·0, 1·0, 0·0) and add → 2. One number out. Two same-width
  sticks can always be dotted. That is the definition of the dot product.
  ─────────────────────────────────────────────────────────────────────────────────────────


  ## Wider Tags Build Bigger Match Numbers, So Divide by √(tag width)

  At width 4, the dot product adds 4 products. At width 32 it adds 32. More terms → larger
  raw numbers, by roughly √(width) -- a scale artifact, not extra signal. Divide each match
  by √(tag width) to cancel it.

  Tag width = 4, so √4 = 2:

      2 / 2 = 1
      8 / 2 = 4

      nolan's scaled match row: [1, 4]

  Real lab: tag width = 32, divisor = √32 ≈ 5.66.

  WRONG TURN  "Divide by √(tag width) -- isn't that standardising? Subtract the mean, divide
              by the scatter?"
  ─────────────────────────────────────────────────────────────────────────────────────────
  That was Chapter 1's move (putting columns on the same ruler). Here there is no mean, no
  scatter from data. The divisor is simply √(how many numbers the tag has) -- a fixed constant
  computed once. √4 = 2 always. √32 ≈ 5.66 always. Nothing computed from the reviews.
  ─────────────────────────────────────────────────────────────────────────────────────────

    >> YOUR TURN
       A match of 12, tag width 4. Scaled = ?

       check: 12 / √4 = 12 / 2 = 6.


  ## Scaled Matches Are Not Fractions Yet -- Softmax Turns Them Into Portions

  nolan has scaled scores [1, 4] -- just numbers, not "how much to listen". The machine needs
  portions adding to 1. Tool: softmax. Raise e (≈ 2.718) to each number's power, divide each
  by the total of the raised values:

      scaled:   1    ,    4
      raise e:  e^1 = 2.718   e^4 = 54.60
      total:    2.718 + 54.60 = 57.32
      nolan's portion:  2.718 / 57.32 = 0.047
      ended's portion: 54.60 / 57.32 = 0.953

      check: 0.047 + 0.953 = 1.000  ✓

  Bigger match → bigger portion. nolan listens 0.953 to "ended", 0.047 to itself.

  WRONG TURN  "Softmax adds the match numbers -- total = 2 + 8 = 10."
  ─────────────────────────────────────────────────────────────────────────────────────────
  Softmax does not add the original match numbers. It raises e to each (4 becomes e^4 = 54.6,
  not 4), then divides each by the total of those raised values. Each number becomes a separate
  portion; none are added together. The result is a row of portions summing to 1, not one sum.
  ─────────────────────────────────────────────────────────────────────────────────────────

    >> YOUR TURN
       Two words, both with scaled match 0. What two portions does softmax give?

       check: e^0 = 1 for each. Total = 2. Both portions = 0.5. Equal listening.


  ## The Portions Weight the GIVE Sticks -- nolan's New Stick Falls Out

  nolan listens 0.047 to itself, 0.953 to "ended". Those portions weight the GIVE sticks:
  multiply each word's GIVE by its portion, then ADD number-by-number → one new stick:

      nolan.GIVE = [2, 0, 0, 1]
      ended.GIVE = [0, 3, 1, 0]

      0.047 × [2, 0, 0, 1] = [0.094, 0,     0,     0.047]
      0.953 × [0, 3, 1, 0] = [0,     2.859, 0.953, 0    ]
      ADD number by number:  [0.094, 2.859, 0.953, 0.047]  ← nolan_new

  nolan's new stick is mostly "ended"'s give, because it listened 0.953 to ended. Every word
  builds its own new stick the same way -- all at once, in parallel. Result: a new SHEET of
  sticks, one per word-slot, each richer than the original embedding.

    >> YOUR TURN
       Portions (0.5, 0.5). GIVE sticks: [4, 0] and [0, 4]. Result = ?

       check: 0.5·[4,0] + 0.5·[0,4] = [2,0]+[0,2] = [2,2].

  WRONG TURN  "We had 2 portions but nolan ended with a 4-wide stick. Shouldn't the result
              have 2 numbers -- one per word?"
  ─────────────────────────────────────────────────────────────────────────────────────────
  The 2 portions are WEIGHTS applied to 2 GIVE sticks, each 4 wide. Portion × stick keeps the
  stick 4 wide: 0.953 × [0,3,1,0] = [0, 2.859, 0.953, 0] -- still 4 wide. Adding two 4-wide
  sticks gives a 4-wide result. "How many words" collapses by ADDING; the stick width survives.
  (Real lab: 100 portions weighting 100 GIVE sticks of width 32 → ADD → one 32-stick.)
  ─────────────────────────────────────────────────────────────────────────────────────────

  Those six moves -- WANT/HAVE/GIVE tags, match, scale, softmax, weighted sum -- are the entire
  attention mechanism. In compact notation:

      Attention = softmax( Q · Kᵀ / √dₖ ) · V

        Q · Kᵀ  = match grid (every WANT against every HAVE)
        / √dₖ   = divide each match by √(tag width)
        softmax  = rows into portions
        · V      = weighted sum of GIVE sticks

  Every piece of that formula is something you computed by hand above.


  ## One Set of Tags Catches One Kind of Link, So Run Two Sets

  One WANT/HAVE/GIVE grid set learns one kind of cross-word relationship (say, which words
  negate each other). A second set of grids -- entirely different dials -- can learn a different
  kind (say, which words name the same thing). Run both sets in parallel, each producing its own
  new stick for nolan, then glue the two sticks together:

      head 1: its own WANT/HAVE/GIVE grids → full match→scale→softmax→weighted-sum → stick A
      head 2: DIFFERENT WANT/HAVE/GIVE grids → same moves → stick B
      glue A and B: [A ... | B ...]  → nolan's richer new stick

  num_heads=2 in code. Each head runs its own full set of grids at key_dim width (here we set
  key_dim = 32, so each head works at 32), then a final grid projects the glued result back to
  32. So the sheet stays 100 × 32, now richer. (The original paper instead carves one width-32
  into two 16-wide heads -- d_model / heads. Keras lets you set key_dim freely; this lab sets
  it to the full 32 rather than carving.)

  Analogy: the BiLSTM glued two walkers facing opposite directions. Multi-head glues two
  viewpoints on word relationships.


  ## No Walk Means No Order -- Flagged

  Looking at all words at once, the machine sees them as a SET, with no sense of sequence.
  {nolan, ended} looks identical to {ended, nolan} before any training adjusts for position.
  Order is gone.

  The full transformer fixes this with positional encoding: add a small position-stamp to each
  word's stick before the attention step (word 1 gets one stamp, word 2 gets another, so they
  differ even before any grids run). The Q8 code below skips explicit positional encoding --
  flagged here so you know the real machine has it. The simplified version still scores well on
  IMDB because "liked or not" depends more on which words appear than exactly where. Position
  matters more for tasks like translation or generation. (Chapter 11, Part 1 builds positional
  encoding by hand, two ways -- a fixed sine/cosine wave and a learned seat-table.)


  ## 100 Sticks Need One Summary, So Average Them All

  After the look-across pass every word-slot has a new stick carrying information from the words
  it matched. The final tick needs ONE number, not 100 sticks.

  The LSTM took its LAST memory -- the one that walked through all 100 words. The transformer
  has no last memory; every slot is equal. So average all 100 new sticks number-by-number →
  one stick:

      word 1 stick = [0.0, 2.8, 0.9, 0.0]
      word 2 stick = [0.2, 1.0, 0.5, 0.4]
      average      = [0.1, 1.9, 0.7, 0.2]    (add, divide by 2)

  The 100 words collapse into one 32-wide summary stick. That is the review summary.

    >> YOUR TURN
       Three sticks: [6, 0, 3], [0, 3, 0], [0, 0, 3]. Average = ?

       check: (6+0+0)/3 = 2, (0+3+0)/3 = 1, (3+0+3)/3 = 2 → [2, 1, 2].


  ## A Small Head Reads the Summary and Writes the Tick

  The summary stick (32 numbers) goes through a short plain-worker pipeline identical to earlier
  labs:

      summary (32 numbers)
        → zero a random 1-in-10 during practice (forces independence)     [Dropout 0.1]
        → 20 workers, each: multiply-add + nudge, then relu               [Dense 20, relu]
             relu: keep positives, zero negatives
        → zero a random 1-in-10 during practice again                     [Dropout 0.1]
        → 1 worker: multiply-add + nudge, then sigmoid                    [Dense 1, sigmoid]
             sigmoid: crush any number to 0..1
             0.9 = 90% liked · 0.1 = probably not

  Dropout fires only during practice. At exam time nothing is zeroed -- every worker is present.
  sigmoid (not softmax) because the answer is ONE yes/no, not a choice among multiple classes.


  ## The Full Machine in Code (Q8)

  Each code line is one pencil move:

      from tensorflow.keras.layers import GlobalAveragePooling1D

      inputs = Input(shape=(MAX_LEN,))
      # a review = 100 word-numbers

      x = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM,
                    input_length=MAX_LEN)(inputs)
      # swap each number for its 32-stick → sheet 100 × 32

      x = MultiHeadAttention(num_heads=2, key_dim=EMBEDDING_DIM)(x, x)
      # 2 heads: WANT · HAVE → ÷√(tag width) → softmax → weighted sum of GIVE
      # (x, x): same sheet passed twice = self-attention
      # output: sheet 100 × 32 (same shape, each stick now richer)

      x = GlobalAveragePooling1D()(x)
      # average all 100 word-sticks → one 32-stick (the review summary)

      x = Dropout(0.1)(x)
      x = Dense(20, activation='relu')(x)
      x = Dropout(0.1)(x)
      outputs = Dense(1, activation='sigmoid')(x)
      # summary → 20 relu workers → 1 sigmoid worker → tick (0..1)

      model_transformer = Model(inputs, outputs)
      # wire the input node to the output node → one named machine

  ★ Layer(settings)(x) -- two sets of parentheses. First: build the layer (its dials and
    shape). Second: run it on x. Build, then run. (Python __call__ -- like a C++ functor's
    operator().)

  ★ (x, x) -- pass x twice because in self-attention every word is both the ASKER and the
    ANSWERER. WANT, HAVE, and GIVE all come from the same word-sticks -- the sentence looking
    at itself.

  Run the attention math yourself -- no numpy, no loop, every move on its own line, exactly
  as on the pencil paper. Two words, so two matches, two shares, four slots: all written out.

      import math

      # nolan's WANT, and each word's HAVE and GIVE (hard-coded from the pencil)
      nolan_want = [2, 0, 1, 0]
      nolan_have = [1, 0, 0, 0];   ended_have = [3, 0, 2, 0]
      nolan_give = [2, 0, 0, 1];   ended_give = [0, 3, 1, 0]

      # MATCH = dot product (multiply slot by slot, add) -- both matches written out
      match_nolan = 2*1 + 0*0 + 1*0 + 0*0          # 2
      match_ended = 2*3 + 0*0 + 1*2 + 0*0          # 8

      # SCALE: divide each by sqrt(tag width) = sqrt(4) = 2
      scaled_nolan = match_nolan / 2               # 1.0
      scaled_ended = match_ended / 2               # 4.0

      # SOFTMAX: raise e to each, divide each by the total
      e_nolan = math.exp(scaled_nolan)             # 2.718
      e_ended = math.exp(scaled_ended)             # 54.60
      total   = e_nolan + e_ended                  # 57.32
      share_nolan = round(e_nolan / total, 3)      # 0.047
      share_ended = round(e_ended / total, 3)      # 0.953
      print(share_nolan, share_ended)              # 0.047 0.953

      # WEIGHTED SUM of the GIVE sticks, one slot at a time (4 slots, written out)
      slot1 = share_nolan*2 + share_ended*0        # 0.094
      slot2 = share_nolan*0 + share_ended*3        # 2.859
      slot3 = share_nolan*0 + share_ended*1        # 0.953
      slot4 = share_nolan*1 + share_ended*0        # 0.047
      print([slot1, slot2, slot3, slot4])          # [0.094, 2.859, 0.953, 0.047]

  Every number that prints matches the pencil paper above. The real lab runs 100 words of
  width 32, so the toolbox above (MultiHeadAttention) loops the identical match-scale-softmax-
  weight over every word pair for you -- the four-slot block here is one cell of that loop.


  ## Five Passes, Then the Four-Way Contest (Q9, Q10)

  Train on the study pile -- 5 passes, 64-review handfuls, score the sealed exam after each
  pass (Q9):

      model_transformer.compile(
          loss='binary_crossentropy',
          optimizer='adam',
          metrics=['accuracy']
      )
      history_transformer = model_transformer.fit(
          X_train, y_train,
          epochs=5,
          batch_size=64,
          validation_data=(X_test, y_test)
      )
      q9_val_acc = round(float(history_transformer.history['val_accuracy'][-1]), 3)

  binary cross-entropy by hand: truth = liked(1), machine said 0.96 → −log(0.96) = 0.04
  (small miss). Said 0.10 → −log(0.10) = 2.30 (large miss). Adam rolls dials downhill.

  Then line up all four sealed scores and name the best (Q10):

      q10_results = {
          'SimpleRNN':   q4_val_acc,
          'LSTM':        q6_val_acc,
          'BiLSTM':      q7_val_acc,
          'Transformer': q9_val_acc,
      }
      q10_best_model = max(q10_results, key=q10_results.get)

  Key names must match letter-for-letter -- the grader checks them. What to expect on IMDB:
  walkers score roughly 0.82 -- 0.87; the transformer often reaches 0.87 -- 0.90. The diff is
  modest on this task. The real advantage is on longer text and larger piles, where the serial
  walk bottleneck and the fade over long distances hurt most.

  The four machines in one comparison:

      SimpleRNN  : one worker, walks word-by-word, one memory that FADES.
      LSTM       : same walk, two memories -- belt keeps far-back words ALIVE.
      BiLSTM     : two LSTM walkers (forward and backward), reports stapled.
      Transformer: no walk, no memory -- every word straight at every other.
      ──────────────────────────────────────────────────────────────────────────
                       speed        long-range link    sense of order
      walkers          serial       through the chain  built into the walk
      Transformer      parallel     direct, no fade    must add position marks


  ## Traps, Each as a Clash

  WRONG TURN  "WANT, HAVE, GIVE -- which one actually scores the match?"
  ─────────────────────────────────────────────────────────────────────────────────────────
  WANT does the looking: nolan's WANT is dotted against every word's HAVE.
  HAVE is what gets matched against. GIVE is what is taken into the weighted sum -- it never
  appears in a dot product. Only WANT dots against HAVE; GIVE is downstream.
  ─────────────────────────────────────────────────────────────────────────────────────────

  WRONG TURN  "Multi-head means the output doubles in width?"
  ─────────────────────────────────────────────────────────────────────────────────────────
  No -- a final grid always projects the glued result back to the input width (32). With this
  lab's key_dim=32, the two heads run at 32 each, glue to 64 inside, then that last grid brings
  it back to 32. Output stays 100 × 32. (The original paper sizes each head at 32/2 = 16 so the
  glue is already 32; same ending, different internal width. Either way the output is 32.)
  ─────────────────────────────────────────────────────────────────────────────────────────

  WRONG TURN  "n−1 matches -- a word doesn't score itself."
  ─────────────────────────────────────────────────────────────────────────────────────────
  A word DOES score itself. nolan's WANT dots nolan's own HAVE and gets 2. The match row has
  n entries (here 2), not n−1. Self-attention is the name; the "self" is literal.
  ─────────────────────────────────────────────────────────────────────────────────────────


  ## One Breath

  Number the words (dictionary, top 10,000 kept). Swap each for a stick (embedding table,
  10,000 sticks × 32 numbers each). Each word makes three tags from three reused grids:
  WANT (what it seeks), HAVE (what it offers), GIVE (what it hands over). nolan's WANT dots
  every word's HAVE → n match numbers → divide by √(tag width) → softmax → portions adding
  to 1. Those portions weight every word's GIVE stick, added number-by-number → nolan's new
  stick, richer than before. Every word does this at the same time. Two heads catch two kinds
  of link. Average all 100 new sticks → one summary stick. A short plain-worker head reads
  the summary and writes the tick.

  No walk. No memory. Every word straight at every other, all at once.


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 10 -- Machines That Read Words):
    Part 1 -- Words Into a Machine .
    Part 2 -- The Two-Memory Worker .
    Part 3 (this post)

  See also: Appendix D: Transformer From Pencil

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================