==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 10 . MACHINES THAT READ WORDS . PART 3 OF 3
The Look-Across Machine: Attention and the Transformer by Pencil
Posted: 2026-06-15 . Author: Rahul Rai . Tags: transformer, attention, nlp, sequence-models
============================================================================================
PATH . post 29 of 31
<- prev: Chapter 10, Part 2: The Two-Memory Worker
next: Chapter 11, Part 1: The Vision Transformer ->
Parts 1 and 2 built two walking machines. The plain RNN read 100 word-notes in order,
rewriting one 32-number memory at each word. By word 90, word 1's signal had been tanh-crushed
89 times and was nearly zero. The LSTM fixed the fade by carrying a second uncrushed memory
on a belt, three voters -- keep, admit, show -- deciding what passes. Both cure the fade.
Neither cures the walk: word 2 must wait for word 1, word 3 for word 2. On a GPU those 100
serial steps are a bottleneck. And "not" (word 1) linking to "good" (word 80) still rides 79
rewrites, even on the best belt.
This post kills the walk. No memory, no chain. Lay all 100 words out at once. Let every word
look DIRECTLY at every other word -- near or far, same cost, all at the same time. That is
the core of the Transformer. "not" connects to "good" in one step, no chain, no fade. And
because there is no chain, every word computes at the same time -- parallel -- fast on a GPU.
That parallelism is precisely why this architecture took over: it scales.
Worked review throughout: "nolan ended" (2 words, to keep pencil arithmetic short). Tag width
4 here, 32 in the real lab -- every move is identical. Pencil and scratch paper ready.
## Words Are Still Not Numbers -- Each Gets a Stick
A clerk multiplies numbers. The word "nolan" is not a number, so before anything runs the
words get numbered. Walk every review, count every word, hand each a number by how often it
shows up. Most common word = 1. This counted list is the DICTIONARY.
THE DICTIONARY (everything seen, ranked by count):
the=1 movie=2 was=3 boring=4 ... nolan=73 ... ended=88 ...
Keep only the top 10,000 entries (VOCAB_SIZE = a number we pick). Every rarer word shares one
"unknown" slot. Pad or chop every review to exactly 100 slots (MAX_LEN = 100):
"nolan ended" → [73, 88, 0, 0, ... , 0] 98 zeros appended → 100 numbers
500-word rant → [first 100 numbers] 400 words chopped
But 73 is only a name-tag. A clerk that multiplied it would count "nolan" (73) as 73 times
heavier than "the" (1) -- false. So swap each bare number for a LIST of numbers, its STICK.
Two shelves: the full dictionary holds every word ever seen (~80,000). The embedding table
is a smaller cabinet -- the top 10,000 words each get their own locker of numbers; every
rarer word shares one unknown locker.
FULL DICTIONARY: EMBEDDING TABLE (10,000 lockers):
┌──────────────────────────┐ ┌────────────────────────────┐
│ the=1 movie=2 was=3 │ │ locker 1 → [32 numbers] │
│ boring=4 ... nolan=73 │ ──────► │ locker 4 → [32 numbers] boring
│ ... [~80,000 words] ... │ top │ locker 73 → [32 numbers] nolan
└──────────────────────────┘ 10,000 │ <unknown> → [32 numbers] ← rare words pile here
└────────────────────────────┘
In this pencil run (width 4):
word 73 (nolan) → [2, 1, 1, 0] ← nolan's stick
word 88 (ended) → [0, 1, 2, 1] ← ended's stick
10,000 × 4 = 40,000 entries in the table (320,000 at real width 32). These start as junk and
get tuned by training: "boring" and "dull" drift close, "nolan" and "the" drift far apart.
After this: every word-slot has a stick. "nolan ended" → a SHEET 2 tall × 4 wide.
>> YOUR TURN
A review says "nolan qxzbr ended". "qxzbr" ranked 400,000th, far outside the top 10,000.
What number does it get in the padded review row?
check: the one shared "unknown" number.
## The Walk Still Fades, So Look Across Instead
The LSTM walked. "not" (word 1) reaches "good" (word 80) only by riding the memory belt
through 79 rewrites. The transformer removes the walk entirely: lay all 100 words out at once
and let every word look straight at every other word, near or far, same cost.
WALK: not → mem → ... 79 rewrites ... → good fades; serial
LOOK-ACROSS: not ←────────── direct ──────────→ good no fade; all at once
"not" and "good" connect in one step. And because there is no chain, every word does its
looking at the same time -- parallel on a GPU. No waiting, no walking, no fade. Three moves
build that look into a real number.
## Three Tags Let Each Word Ask, Answer, and Give
For two words to look at each other, each needs something to ask WITH, something to get
matched AGAINST, and something to hand over IF picked. So each word makes three new sticks
from its embedding, each through its own square grid of dials:
WANT (query) -- "what am I looking for?" ← does the looking
HAVE (key) -- "what do I offer?" ← gets looked at
GIVE (value) -- "what I hand over if picked"
Why three: asking is a different job from being matched, and what you advertise (HAVE) should
not be the same thing as what you actually hand over (GIVE). Three jobs, three sticks.
How one tag is made. A grid row is a list of dials. Row "times" a stick: multiply matching
numbers, add → one output number. All rows → the new stick. Here is nolan's WANT worked out:
nolan's stick = [2, 1, 1, 0]
WANT-grid rows (a 4×4 grid of dials, made-up here):
row 1 = [1, 0, 0, 0] → 1·2 + 0·1 + 0·1 + 0·0 = 2
row 2 = [0, 0, 0, 0] → 0
row 3 = [0, 0, 1, 0] → 0·2 + 0·1 + 1·1 + 0·0 = 1
row 4 = [0, 0, 0, 0] → 0
→ nolan.WANT = [2, 0, 1, 0]
The same three grids run over every word in the review -- one reusable set of grids applied
to all. Below are the HAVE and GIVE values used throughout this pencil run (made-up so the
arithmetic stays short):
nolan.HAVE = [1, 0, 0, 0] nolan.GIVE = [2, 0, 0, 1]
ended.HAVE = [3, 0, 2, 0] ended.GIVE = [0, 3, 1, 0]
>> YOUR TURN
A WANT-grid row is [0, 1, 0, 0]. nolan's stick is [2, 1, 1, 0]. Row times stick = ?
check: 0·2 + 1·1 + 0·1 + 0·0 = 1.
## nolan's WANT Scores Every Word, Including Itself
How much does nolan care about a word? Dot nolan's WANT against that word's HAVE: multiply
matching numbers, add → one number. Both sticks are the same width (4), so they line up;
read one flat as a row, the other as a column, multiply pairs, add.
nolan.WANT = [2, 0, 1, 0]
vs nolan.HAVE = [1, 0, 0, 0]: 2·1 + 0·0 + 1·0 + 0·0 = 2
vs ended.HAVE = [3, 0, 2, 0]: 2·3 + 0·0 + 1·2 + 0·0 = 6 + 2 = 8
nolan's match row: [2, 8]
nolan scores EVERY word including itself -- the match row has n entries (here 2), not n−1.
Bigger number = nolan's WANT matches that word's HAVE better.
>> YOUR TURN
nolan.WANT = [2, 0, 1, 0]. Some word has HAVE = [1, 0, 1, 0]. Match = ?
check: 2·1 + 0·0 + 1·1 + 0·0 = 3.
WRONG TURN "Two sticks of the same width can't be dotted -- they'd need different shapes."
─────────────────────────────────────────────────────────────────────────────────────────
Lay one flat as a row: [2, 0, 1, 0]. Stand the other as a column: [1, 0, 0, 0]^T.
Multiply matching pairs (2·1, 0·0, 1·0, 0·0) and add → 2. One number out. Two same-width
sticks can always be dotted. That is the definition of the dot product.
─────────────────────────────────────────────────────────────────────────────────────────
## Wider Tags Build Bigger Match Numbers, So Divide by √(tag width)
At width 4, the dot product adds 4 products. At width 32 it adds 32. More terms → larger
raw numbers, by roughly √(width) -- a scale artifact, not extra signal. Divide each match
by √(tag width) to cancel it.
Tag width = 4, so √4 = 2:
2 / 2 = 1
8 / 2 = 4
nolan's scaled match row: [1, 4]
Real lab: tag width = 32, divisor = √32 ≈ 5.66.
WRONG TURN "Divide by √(tag width) -- isn't that standardising? Subtract the mean, divide
by the scatter?"
─────────────────────────────────────────────────────────────────────────────────────────
That was Chapter 1's move (putting columns on the same ruler). Here there is no mean, no
scatter from data. The divisor is simply √(how many numbers the tag has) -- a fixed constant
computed once. √4 = 2 always. √32 ≈ 5.66 always. Nothing computed from the reviews.
─────────────────────────────────────────────────────────────────────────────────────────
>> YOUR TURN
A match of 12, tag width 4. Scaled = ?
check: 12 / √4 = 12 / 2 = 6.
## Scaled Matches Are Not Fractions Yet -- Softmax Turns Them Into Portions
nolan has scaled scores [1, 4] -- just numbers, not "how much to listen". The machine needs
portions adding to 1. Tool: softmax. Raise e (≈ 2.718) to each number's power, divide each
by the total of the raised values:
scaled: 1 , 4
raise e: e^1 = 2.718 e^4 = 54.60
total: 2.718 + 54.60 = 57.32
nolan's portion: 2.718 / 57.32 = 0.047
ended's portion: 54.60 / 57.32 = 0.953
check: 0.047 + 0.953 = 1.000 ✓
Bigger match → bigger portion. nolan listens 0.953 to "ended", 0.047 to itself.
WRONG TURN "Softmax adds the match numbers -- total = 2 + 8 = 10."
─────────────────────────────────────────────────────────────────────────────────────────
Softmax does not add the original match numbers. It raises e to each (4 becomes e^4 = 54.6,
not 4), then divides each by the total of those raised values. Each number becomes a separate
portion; none are added together. The result is a row of portions summing to 1, not one sum.
─────────────────────────────────────────────────────────────────────────────────────────
>> YOUR TURN
Two words, both with scaled match 0. What two portions does softmax give?
check: e^0 = 1 for each. Total = 2. Both portions = 0.5. Equal listening.
## The Portions Weight the GIVE Sticks -- nolan's New Stick Falls Out
nolan listens 0.047 to itself, 0.953 to "ended". Those portions weight the GIVE sticks:
multiply each word's GIVE by its portion, then ADD number-by-number → one new stick:
nolan.GIVE = [2, 0, 0, 1]
ended.GIVE = [0, 3, 1, 0]
0.047 × [2, 0, 0, 1] = [0.094, 0, 0, 0.047]
0.953 × [0, 3, 1, 0] = [0, 2.859, 0.953, 0 ]
ADD number by number: [0.094, 2.859, 0.953, 0.047] ← nolan_new
nolan's new stick is mostly "ended"'s give, because it listened 0.953 to ended. Every word
builds its own new stick the same way -- all at once, in parallel. Result: a new SHEET of
sticks, one per word-slot, each richer than the original embedding.
>> YOUR TURN
Portions (0.5, 0.5). GIVE sticks: [4, 0] and [0, 4]. Result = ?
check: 0.5·[4,0] + 0.5·[0,4] = [2,0]+[0,2] = [2,2].
WRONG TURN "We had 2 portions but nolan ended with a 4-wide stick. Shouldn't the result
have 2 numbers -- one per word?"
─────────────────────────────────────────────────────────────────────────────────────────
The 2 portions are WEIGHTS applied to 2 GIVE sticks, each 4 wide. Portion × stick keeps the
stick 4 wide: 0.953 × [0,3,1,0] = [0, 2.859, 0.953, 0] -- still 4 wide. Adding two 4-wide
sticks gives a 4-wide result. "How many words" collapses by ADDING; the stick width survives.
(Real lab: 100 portions weighting 100 GIVE sticks of width 32 → ADD → one 32-stick.)
─────────────────────────────────────────────────────────────────────────────────────────
Those six moves -- WANT/HAVE/GIVE tags, match, scale, softmax, weighted sum -- are the entire
attention mechanism. In compact notation:
Attention = softmax( Q · Kᵀ / √dₖ ) · V
Q · Kᵀ = match grid (every WANT against every HAVE)
/ √dₖ = divide each match by √(tag width)
softmax = rows into portions
· V = weighted sum of GIVE sticks
Every piece of that formula is something you computed by hand above.
## One Set of Tags Catches One Kind of Link, So Run Two Sets
One WANT/HAVE/GIVE grid set learns one kind of cross-word relationship (say, which words
negate each other). A second set of grids -- entirely different dials -- can learn a different
kind (say, which words name the same thing). Run both sets in parallel, each producing its own
new stick for nolan, then glue the two sticks together:
head 1: its own WANT/HAVE/GIVE grids → full match→scale→softmax→weighted-sum → stick A
head 2: DIFFERENT WANT/HAVE/GIVE grids → same moves → stick B
glue A and B: [A ... | B ...] → nolan's richer new stick
num_heads=2 in code. Each head runs its own full set of grids at key_dim width (here we set
key_dim = 32, so each head works at 32), then a final grid projects the glued result back to
32. So the sheet stays 100 × 32, now richer. (The original paper instead carves one width-32
into two 16-wide heads -- d_model / heads. Keras lets you set key_dim freely; this lab sets
it to the full 32 rather than carving.)
Analogy: the BiLSTM glued two walkers facing opposite directions. Multi-head glues two
viewpoints on word relationships.
## No Walk Means No Order -- Flagged
Looking at all words at once, the machine sees them as a SET, with no sense of sequence.
{nolan, ended} looks identical to {ended, nolan} before any training adjusts for position.
Order is gone.
The full transformer fixes this with positional encoding: add a small position-stamp to each
word's stick before the attention step (word 1 gets one stamp, word 2 gets another, so they
differ even before any grids run). The Q8 code below skips explicit positional encoding --
flagged here so you know the real machine has it. The simplified version still scores well on
IMDB because "liked or not" depends more on which words appear than exactly where. Position
matters more for tasks like translation or generation. (Chapter 11, Part 1 builds positional
encoding by hand, two ways -- a fixed sine/cosine wave and a learned seat-table.)
## 100 Sticks Need One Summary, So Average Them All
After the look-across pass every word-slot has a new stick carrying information from the words
it matched. The final tick needs ONE number, not 100 sticks.
The LSTM took its LAST memory -- the one that walked through all 100 words. The transformer
has no last memory; every slot is equal. So average all 100 new sticks number-by-number →
one stick:
word 1 stick = [0.0, 2.8, 0.9, 0.0]
word 2 stick = [0.2, 1.0, 0.5, 0.4]
average = [0.1, 1.9, 0.7, 0.2] (add, divide by 2)
The 100 words collapse into one 32-wide summary stick. That is the review summary.
>> YOUR TURN
Three sticks: [6, 0, 3], [0, 3, 0], [0, 0, 3]. Average = ?
check: (6+0+0)/3 = 2, (0+3+0)/3 = 1, (3+0+3)/3 = 2 → [2, 1, 2].
## A Small Head Reads the Summary and Writes the Tick
The summary stick (32 numbers) goes through a short plain-worker pipeline identical to earlier
labs:
summary (32 numbers)
→ zero a random 1-in-10 during practice (forces independence) [Dropout 0.1]
→ 20 workers, each: multiply-add + nudge, then relu [Dense 20, relu]
relu: keep positives, zero negatives
→ zero a random 1-in-10 during practice again [Dropout 0.1]
→ 1 worker: multiply-add + nudge, then sigmoid [Dense 1, sigmoid]
sigmoid: crush any number to 0..1
0.9 = 90% liked · 0.1 = probably not
Dropout fires only during practice. At exam time nothing is zeroed -- every worker is present.
sigmoid (not softmax) because the answer is ONE yes/no, not a choice among multiple classes.
## The Full Machine in Code (Q8)
Each code line is one pencil move:
from tensorflow.keras.layers import GlobalAveragePooling1D
inputs = Input(shape=(MAX_LEN,))
# a review = 100 word-numbers
x = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM,
input_length=MAX_LEN)(inputs)
# swap each number for its 32-stick → sheet 100 × 32
x = MultiHeadAttention(num_heads=2, key_dim=EMBEDDING_DIM)(x, x)
# 2 heads: WANT · HAVE → ÷√(tag width) → softmax → weighted sum of GIVE
# (x, x): same sheet passed twice = self-attention
# output: sheet 100 × 32 (same shape, each stick now richer)
x = GlobalAveragePooling1D()(x)
# average all 100 word-sticks → one 32-stick (the review summary)
x = Dropout(0.1)(x)
x = Dense(20, activation='relu')(x)
x = Dropout(0.1)(x)
outputs = Dense(1, activation='sigmoid')(x)
# summary → 20 relu workers → 1 sigmoid worker → tick (0..1)
model_transformer = Model(inputs, outputs)
# wire the input node to the output node → one named machine
★ Layer(settings)(x) -- two sets of parentheses. First: build the layer (its dials and
shape). Second: run it on x. Build, then run. (Python __call__ -- like a C++ functor's
operator().)
★ (x, x) -- pass x twice because in self-attention every word is both the ASKER and the
ANSWERER. WANT, HAVE, and GIVE all come from the same word-sticks -- the sentence looking
at itself.
Run the attention math yourself -- no numpy, no loop, every move on its own line, exactly
as on the pencil paper. Two words, so two matches, two shares, four slots: all written out.
import math
# nolan's WANT, and each word's HAVE and GIVE (hard-coded from the pencil)
nolan_want = [2, 0, 1, 0]
nolan_have = [1, 0, 0, 0]; ended_have = [3, 0, 2, 0]
nolan_give = [2, 0, 0, 1]; ended_give = [0, 3, 1, 0]
# MATCH = dot product (multiply slot by slot, add) -- both matches written out
match_nolan = 2*1 + 0*0 + 1*0 + 0*0 # 2
match_ended = 2*3 + 0*0 + 1*2 + 0*0 # 8
# SCALE: divide each by sqrt(tag width) = sqrt(4) = 2
scaled_nolan = match_nolan / 2 # 1.0
scaled_ended = match_ended / 2 # 4.0
# SOFTMAX: raise e to each, divide each by the total
e_nolan = math.exp(scaled_nolan) # 2.718
e_ended = math.exp(scaled_ended) # 54.60
total = e_nolan + e_ended # 57.32
share_nolan = round(e_nolan / total, 3) # 0.047
share_ended = round(e_ended / total, 3) # 0.953
print(share_nolan, share_ended) # 0.047 0.953
# WEIGHTED SUM of the GIVE sticks, one slot at a time (4 slots, written out)
slot1 = share_nolan*2 + share_ended*0 # 0.094
slot2 = share_nolan*0 + share_ended*3 # 2.859
slot3 = share_nolan*0 + share_ended*1 # 0.953
slot4 = share_nolan*1 + share_ended*0 # 0.047
print([slot1, slot2, slot3, slot4]) # [0.094, 2.859, 0.953, 0.047]
Every number that prints matches the pencil paper above. The real lab runs 100 words of
width 32, so the toolbox above (MultiHeadAttention) loops the identical match-scale-softmax-
weight over every word pair for you -- the four-slot block here is one cell of that loop.
## Five Passes, Then the Four-Way Contest (Q9, Q10)
Train on the study pile -- 5 passes, 64-review handfuls, score the sealed exam after each
pass (Q9):
model_transformer.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
history_transformer = model_transformer.fit(
X_train, y_train,
epochs=5,
batch_size=64,
validation_data=(X_test, y_test)
)
q9_val_acc = round(float(history_transformer.history['val_accuracy'][-1]), 3)
binary cross-entropy by hand: truth = liked(1), machine said 0.96 → −log(0.96) = 0.04
(small miss). Said 0.10 → −log(0.10) = 2.30 (large miss). Adam rolls dials downhill.
Then line up all four sealed scores and name the best (Q10):
q10_results = {
'SimpleRNN': q4_val_acc,
'LSTM': q6_val_acc,
'BiLSTM': q7_val_acc,
'Transformer': q9_val_acc,
}
q10_best_model = max(q10_results, key=q10_results.get)
Key names must match letter-for-letter -- the grader checks them. What to expect on IMDB:
walkers score roughly 0.82 -- 0.87; the transformer often reaches 0.87 -- 0.90. The diff is
modest on this task. The real advantage is on longer text and larger piles, where the serial
walk bottleneck and the fade over long distances hurt most.
The four machines in one comparison:
SimpleRNN : one worker, walks word-by-word, one memory that FADES.
LSTM : same walk, two memories -- belt keeps far-back words ALIVE.
BiLSTM : two LSTM walkers (forward and backward), reports stapled.
Transformer: no walk, no memory -- every word straight at every other.
──────────────────────────────────────────────────────────────────────────
speed long-range link sense of order
walkers serial through the chain built into the walk
Transformer parallel direct, no fade must add position marks
## Traps, Each as a Clash
WRONG TURN "WANT, HAVE, GIVE -- which one actually scores the match?"
─────────────────────────────────────────────────────────────────────────────────────────
WANT does the looking: nolan's WANT is dotted against every word's HAVE.
HAVE is what gets matched against. GIVE is what is taken into the weighted sum -- it never
appears in a dot product. Only WANT dots against HAVE; GIVE is downstream.
─────────────────────────────────────────────────────────────────────────────────────────
WRONG TURN "Multi-head means the output doubles in width?"
─────────────────────────────────────────────────────────────────────────────────────────
No -- a final grid always projects the glued result back to the input width (32). With this
lab's key_dim=32, the two heads run at 32 each, glue to 64 inside, then that last grid brings
it back to 32. Output stays 100 × 32. (The original paper sizes each head at 32/2 = 16 so the
glue is already 32; same ending, different internal width. Either way the output is 32.)
─────────────────────────────────────────────────────────────────────────────────────────
WRONG TURN "n−1 matches -- a word doesn't score itself."
─────────────────────────────────────────────────────────────────────────────────────────
A word DOES score itself. nolan's WANT dots nolan's own HAVE and gets 2. The match row has
n entries (here 2), not n−1. Self-attention is the name; the "self" is literal.
─────────────────────────────────────────────────────────────────────────────────────────
## One Breath
Number the words (dictionary, top 10,000 kept). Swap each for a stick (embedding table,
10,000 sticks × 32 numbers each). Each word makes three tags from three reused grids:
WANT (what it seeks), HAVE (what it offers), GIVE (what it hands over). nolan's WANT dots
every word's HAVE → n match numbers → divide by √(tag width) → softmax → portions adding
to 1. Those portions weight every word's GIVE stick, added number-by-number → nolan's new
stick, richer than before. Every word does this at the same time. Two heads catch two kinds
of link. Average all 100 new sticks → one summary stick. A short plain-worker head reads
the summary and writes the tick.
No walk. No memory. Every word straight at every other, all at once.
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 10 -- Machines That Read Words):
Part 1 -- Words Into a Machine .
Part 2 -- The Two-Memory Worker .
Part 3 (this post)
See also: Appendix D: Transformer From Pencil
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================