==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
APPENDIX D . TRANSFORMER FROM PENCIL
Attention From Scratch, One Number at a Time
Posted: 2026-06-15 . Author: Rahul Rai . Tags: transformer, attention, by-hand, kata
============================================================================================
PATH . APPENDIX D -- Transformer From Pencil (standalone; no other page needed)
Goal: read a movie review -- plain words -- and stamp it liked (1) or not (0). Kit:
a pencil, scratch paper. No code, no computer.
This is a KATA. Work it on a blank sheet, cover each CHECK line until you get your own
answer, then compare. Repeat the whole thing when you want it to feel automatic. Every
number is recomputed here -- nothing carried in from memory.
Worked review: "nolan ended" (2 words). Tag width 4 throughout (real width 32 -- identical
moves). Nothing is named before a number earns the name.
## A Machine Only Multiplies Numbers, So Words Must Become Numbers First
Walk every review in the pile. Count every word. Hand each word a number by how often it
appears -- most common word gets number 1. This is the DICTIONARY.
DICTIONARY (ranked by count):
the=1 movie=2 was=3 boring=4 ... nolan=73 ... ended=88 ...
Keep only the top 10,000. Every rarer word shares one "unknown" number -- its OWN reserved
tag, separate from the padding 0. (Padding marks an empty slot; unknown marks a real but rare
word. They must never read as the same number, or the machine cannot tell empty from rare.)
Pad or chop every review to exactly 100 word-slots, padding with 0:
"nolan ended" → [73, 88, 0, 0, ... , 0] 98 padding zeros → 100 numbers
500-word rant → [first 100] 400 chopped
73 is a name-tag, not a meaning. A machine that multiplied it raw would count "nolan" (73)
as 73 times heavier than "the" (1) -- false. So swap each bare number for a LIST of numbers,
its STICK. The embedding table holds one stick per kept word:
embedding table (width 4 here):
nolan (73) → [2, 1, 1, 0]
ended (88) → [0, 1, 2, 1]
10,000 words × 4 numbers = 40,000 entries total. These start as junk and get tuned by
training until "boring" and "dull" drift close, "nolan" and "the" drift far.
>> YOUR TURN
The word "qxzbr" ranked 400,000th -- outside the top 10,000. What number does it get
in the padded review row?
check: the shared "unknown" tag -- a reserved number of its own, never the padding 0.
After this: "nolan ended" → SHEET 2 tall × 4 wide. Every word-slot has a stick.
## Walking Through 79 Rewrites Still Fades, So Look Across Instead
One older way to read a sentence WALKS it.
A WALK means: keep one pocket of numbers called a memory.
Word 1 writes that memory.
Word 2 reads that memory, then rewrites it.
This repeats, one word at a time, for all 100 words.
To link "not" (word 1) to "good" (word 80), the signal must ride 79 rewrites.
Each rewrite multiplies the signal by some number and crushes it a little.
Therefore far-back words fade by the time the walk reaches the end.
The transformer removes the walk. Lay all words out at once. Let each word look STRAIGHT
at every other word, near or far, same cost, all at the same time:
WALK: not → mem → 79 rewrites → good fades; one word at a time
LOOK-ACROSS: not ←──── direct ────→ good no fade; all at once
No chain means no fade, and no chain means every word computes in parallel.
## Each Word Makes Three Tags to Ask, Answer, and Give
For two words to look at each other, each needs three roles:
WANT (query) -- "what am I looking for?" ← does the looking
HAVE (key) -- "what do I offer?" ← gets looked at
GIVE (value) -- "what I hand over if picked"
Each role gets its own grid of dials (a square sheet, 4×4 here). Each grid is REUSED
on every word in the review -- one grid set for all words.
How one tag is made. A grid row is a list of dials. Row times a stick: multiply matching
numbers, add → one number. All rows together → one new stick.
nolan's WANT computed by hand (made-up WANT-grid rows):
nolan's stick = [2, 1, 1, 0]
WANT-grid row 1 = [1, 0, 0, 0] → 1·2 + 0·1 + 0·1 + 0·0 = 2
WANT-grid row 2 = [0, 0, 0, 0] → 0
WANT-grid row 3 = [0, 0, 1, 0] → 0·2 + 0·1 + 1·1 + 0·0 = 1
WANT-grid row 4 = [0, 0, 0, 0] → 0
nolan.WANT = [2, 0, 1, 0]
Same process (different grids) produces nolan.HAVE, nolan.GIVE, ended.WANT, ended.HAVE,
ended.GIVE. Values used throughout this kata (made-up, chosen so arithmetic stays short):
nolan.HAVE = [1, 0, 0, 0] nolan.GIVE = [2, 0, 0, 1]
ended.HAVE = [3, 0, 2, 0] ended.GIVE = [0, 3, 1, 0]
>> YOUR TURN
WANT-grid row [0, 1, 0, 0] times nolan's stick [2, 1, 1, 0] = ?
check: 0·2 + 1·1 + 0·1 + 0·0 = 1.
## nolan's WANT Dots Every Word's HAVE
Dot product: lay one stick flat as a row, the other as a column, multiply matching pairs, add.
nolan.WANT = [2, 0, 1, 0]
vs nolan.HAVE = [1, 0, 0, 0]: 2·1 + 0·0 + 1·0 + 0·0 = 2
vs ended.HAVE = [3, 0, 2, 0]: 2·3 + 0·0 + 1·2 + 0·0 = 8
nolan's match row: [2, 8]
A word dots EVERY word, including itself. Match row has n entries (here 2), not n−1.
>> YOUR TURN
nolan.WANT = [2, 0, 1, 0]. Some word's HAVE = [1, 0, 1, 0]. Match = ?
check: 2·1 + 0·0 + 1·1 + 0·0 = 3.
WRONG TURN "Two sticks of width 4 can't be dotted -- wrong shapes."
─────────────────────────────────────────────────────────────────────────────────────────
Lay one flat: [2,0,1,0]. Stand the other as column. Multiply pairs, add → one number.
Two same-width sticks always dot. That is the dot product.
─────────────────────────────────────────────────────────────────────────────────────────
## Wider Tags Bloat the Match Numbers, So Divide by √(tag width)
Width 4 → 4 products added. Width 32 → 32 products. More terms → bigger numbers by √(width).
That is a scale artifact. Remove it: divide each match by √(tag width).
Tag width = 4, √4 = 2:
2 / 2 = 1
8 / 2 = 4
nolan's scaled row: [1, 4]
>> YOUR TURN
Match of 10, tag width 4. Scaled = ?
check: 10 / 2 = 5.
WRONG TURN "Divide by √(tag width) -- same as subtracting the mean and dividing by scatter?"
─────────────────────────────────────────────────────────────────────────────────────────
No. Subtracting a mean and dividing by scatter reads your data: it finds the middle of the
actual numbers and how far they spread, then re-centres them. That is a different move.
Here there is no mean, no scatter, no looking at the numbers at all.
You divide only by √(how many numbers the tag holds) -- a fixed constant set by the width.
Tag width is 4, so √4 = 2, always, no matter what the numbers are.
─────────────────────────────────────────────────────────────────────────────────────────
## Scores Are Not Fractions Yet -- Softmax Makes Them
Scaled scores [1, 4] are raw numbers. Turn them into portions adding to 1. Recipe: raise e
(≈ 2.718) to each score's power, divide each by the total of the raised values.
e^1 = 2.718 e^4 = 54.60 total = 57.32
nolan's portion: 2.718 / 57.32 = 0.047
ended's portion: 54.60 / 57.32 = 0.953
check: 0.047 + 0.953 = 1.000 ✓
nolan listens 0.953 to "ended", 0.047 to itself.
>> YOUR TURN
Two words, both scaled score 0. Softmax = ?
check: e^0 = 1 both. Total = 2. Each portion = 0.5.
WRONG TURN "Softmax adds the match numbers: 2 + 8 = 10."
─────────────────────────────────────────────────────────────────────────────────────────
Softmax raises e to each number FIRST (4 → e^4 = 54.6, not 4), then divides by the total
of the raised values. Nothing is added raw. Each number becomes a separate portion.
─────────────────────────────────────────────────────────────────────────────────────────
## The Portions Weight the GIVE Sticks -- nolan's New Stick Falls Out
Multiply each word's GIVE stick by its portion, then ADD number-by-number:
nolan.GIVE = [2, 0, 0, 1]
ended.GIVE = [0, 3, 1, 0]
0.047 × [2, 0, 0, 1] = [0.094, 0, 0, 0.047]
0.953 × [0, 3, 1, 0] = [0, 2.859, 0.953, 0 ]
ADD: [0.094, 2.859, 0.953, 0.047] ← nolan_new
nolan's new stick is mostly "ended"'s give (0.953 weight). Every word builds its own new
stick at the same time.
>> YOUR TURN
Portions (0.5, 0.5). GIVE sticks: [4, 0] and [0, 4]. Result = ?
check: 0.5·[4,0] + 0.5·[0,4] = [2,0]+[0,2] = [2,2].
WRONG TURN "2 portions but a 4-wide result -- shouldn't it be 2 numbers wide?"
─────────────────────────────────────────────────────────────────────────────────────────
Portion × stick keeps the stick width: 0.953 × [0,3,1,0] is [0,2.859,0.953,0] -- still 4
wide. Adding two 4-wide sticks → 4-wide result. "How many words" collapses by ADDING;
stick width survives. (100 words × 32-wide GIVE sticks → ADD → one 32-wide stick.)
─────────────────────────────────────────────────────────────────────────────────────────
The whole run as one formula:
Attention = softmax( Q · Kᵀ / √dₖ ) · V
Every symbol here is a pencil move done above.
Q is the stack of WANT tags (what each word looks for).
K is the stack of HAVE tags (what each word offers); Kᵀ stands the HAVE tags up as columns to dot.
Q · Kᵀ is every WANT dotted with every HAVE -- the match numbers.
√dₖ is the square root of the tag width, the fixed divisor that removes the scale artifact.
softmax turns the scaled matches into portions that add to 1.
V is the stack of GIVE tags (what each word hands over); multiplying by V is the weighted sum.
## Two Heads Catch Two Kinds of Link
One set of WANT/HAVE/GIVE grids learns one kind of cross-word relationship. A second set of
grids -- different dials -- learns another kind. To run two heads WITHOUT bloating the stick,
make each head's tags narrower: instead of grids that read width 4 and write width 4, each
head's grids read the full width-4 stick but write a width-2 tag (4→2, not 4→4). Each head
does the full six moves in its own width-2 world and hands back a width-2 stick:
head 1: own WANT/HAVE/GIVE (each 4→2) → six moves → stick A (width 2)
head 2: DIFFERENT grids (each 4→2) → six moves → stick B (width 2)
glue: [A | B] → width-4 stick -- back to the width we started with
Two width-2 sticks glue to width 4, not 8.
This is the original width-4 split two ways, then sewn back.
But glued halves still sit side by side, untouched by each other.
So one last grid mixes the glued width-4 stick number-by-number.
Which means the two heads' findings blend instead of staying in separate halves.
That mixing grid is the output projection inside MultiHeadAttention.
Running two head-sets like this is what num_heads=2 means.
## No Walk Means No Order -- Flagged
Looking at all words at once, the machine sees a SET.
A SET has no order, so {nolan, ended} and {ended, nolan} look identical to it.
The full transformer fixes this by adding a position-stamp to each word's stick before
attention runs.
A position-stamp is a fixed row of numbers unique to slot 1, slot 2, and so on.
Adding it lets a word's stick carry where the word sat.
This short worked example skips that stamp on purpose, to keep the arithmetic clean.
Flagged so you know it is missing here.
Two more honest simplifications, flagged the same way.
First, the 98 padding slots.
Padding slots are the empty slots filled with 0 when a review is shorter than 100 words.
In the full model a PADDING-MASK sets every padding slot's score to minus-infinity before softmax.
Softmax raises e to each score, and e^(−inf)=0, so every padding slot gets zero portion.
Therefore padding slots neither pull attention nor enter the average.
This worked example uses the 2-word review "nolan ended" with no padding in play.
So the mask never bites in the arithmetic here.
On a real 100-slot row it does the heavy lifting of keeping empty slots out of every sum.
Second, real models wrap attention in two extra moves left out here.
One is a keep-the-original wire: add the stick that went IN back onto the stick that came OUT,
so the original is never lost (this is called a skip connection).
The other is a per-stick humbler: take one stick's own numbers, shift their middle to 0 and
their spread to 1 (this is called LayerNorm).
Both are left out on purpose -- this is one layer, one short review.
## 100 Sticks Need One Summary -- Average Them All
After attention, every word-slot has a new stick.
The final yes/no tick needs ONE stick, not 100.
So average the real word-sticks number-by-number.
Padding slots (the empty 0-filled slots) are left out of this average, since they hold no word.
word 1 = [0.0, 2.8, 0.9, 0.0]
word 2 = [0.2, 1.0, 0.5, 0.4]
average = [0.1, 1.9, 0.7, 0.2] (add, divide by 2)
>> YOUR TURN
Sticks [4, 2] and [0, 0]. Average = ?
check: (4+0)/2=2, (2+0)/2=1 → [2, 1].
## A Small Head Reads the Summary and Writes the Tick
summary stick
→ zero a random 1-in-10 (practice only) [Dropout 0.1]
→ 20 workers, multiply-add + nudge, relu [Dense 20, relu]
relu: keep positives, zero negatives
→ zero a random 1-in-10 again [Dropout 0.1]
→ 1 worker, multiply-add + nudge, sigmoid [Dense 1, sigmoid]
sigmoid: crush any number to 0..1
0.9 = 90% liked . 0.1 = probably not liked
Dropout fires during practice only.
## Traps, Each as a Clash
WRONG TURN "Which tag scores? WANT or HAVE?"
─────────────────────────────────────────────────────────────────────────────────────────
WANT does the scoring: nolan's WANT dots every word's HAVE.
HAVE is what gets looked at. GIVE is what is taken into the weighted sum. The dot is WANT
against HAVE -- both sit in it, WANT doing the looking; GIVE is the only tag that never enters
a dot product.
─────────────────────────────────────────────────────────────────────────────────────────
WRONG TURN "A word can't match itself -- it has no useful information about itself."
─────────────────────────────────────────────────────────────────────────────────────────
A word DOES match itself: nolan.WANT · nolan.HAVE = 2, which gives it a portion 0.047 in the
weighted sum. In longer reviews self-attention weights are often small (other words are more
relevant), but the match is always computed. n entries, not n−1.
─────────────────────────────────────────────────────────────────────────────────────────
## One Recitation (Carry This Away)
Number words. Swap each for a stick (embedding table). Each word makes WANT / HAVE / GIVE
from three reused grids. nolan's WANT dots every HAVE → n matches → ÷ √(tag width) →
softmax → portions summing to 1 → weight every GIVE stick, add number-by-number → nolan's
new stick [0.094, 2.859, 0.953, 0.047]. Every word does this at once. Two heads, two
viewpoints, glued. Average 100 new sticks → one summary. Dropout → 20 relu workers →
Dropout → 1 sigmoid → the tick.
No walk. No memory. Every word straight at every other.
----------------------------------------------------------------------------------------------
APPENDIX D . TRANSFORMER FROM PENCIL
See also: Chapter 10, Part 3: The Look-Across Machine
companion to: Chapter 10, Part 1 .
Chapter 10, Part 2
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================