==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
APPENDIX E . VISION TRANSFORMER FROM PENCIL
Strips, Seat-Stamps, and Masks by Hand
Posted: 2026-06-15 . Author: Rahul Rai . Tags: vision-transformer, layernorm, masking, kata
============================================================================================
PATH . APPENDIX E -- Vision Transformer From Pencil (standalone; no other page needed)
A KATA. Work each number on a blank sheet, cover the CHECK line until you have your own
answer, then compare. Repeat the whole page until your hand knows it. Five pencil-able cores
of the vision transformer: cut a photo into strips, stamp the seat, look across, tame one
strip, and block the future. Tiny sizes so the arithmetic fits; the moves are identical at
full size. Nothing carried in from memory.
## A Photo Is a Grid of Numbers, Cut Into Square Strips
Take a tiny 4 x 4 photo (real ones are 28 x 28), one greyness number per dot:
1 2 | 3 4
5 6 | 7 8
-------+-------
9 10 | 11 12
13 14 | 15 16
Cut into 2 x 2 squares → 2 across, 2 down → 4 strips, each 4 dots. The four squares are:
top-left : 1, 2, 5, 6 top-right : 3, 4, 7, 8
bot-left : 9, 10, 13, 14 bot-right : 11, 12, 15, 16
In memory the photo is one flat line: 1,2,3,4,5,6,...,16 (row after row). A square's 4 dots
are NOT next to each other in that line -- a square spans two photo-rows. Three moves carve
them out: rename (fold), re-order (swap), glue.
FOLD read the line as (down-band, in-square-down, across-band, in-square-across)
SWAP bring the two band labels together, the two in-square labels together
GLUE write each square's 4 dots as one row → 4 rows of 4
The core rule: reshape KEEPS the dots and RENAMES them (each dot just gets a longer address;
nothing moves). transpose KEEPS the names and REORDERS them (dots get new neighbours). Neither
can do the other's job -- which is why both are needed, and the FOLD must come first (you
cannot re-order labels that do not yet exist).
>> YOUR TURN
Skip the SWAP -- read the flat line straight, four dots per strip:
[1,2,3,4], [5,6,7,8], [9,10,13,14]... what went wrong with strip 1?
check: [1,2,3,4] is the whole TOP ROW of the photo, not a 2 x 2 square. Without the swap,
a strip is a full-width slice, not a square. (On a real photo whose top rows are
background, every such strip prints all-black -- the bug that flags a missing swap.)
With the swap, strip 1 is [1, 2, 5, 6] -- a real 2 x 2 corner. Stack the four:
[ 1, 2, 5, 6]
[ 3, 4, 7, 8]
[ 9, 10, 13, 14]
[11, 12, 15, 16]
>> YOUR TURN -- trace one dot
Dot "7" sits at row 1, col 2 (zero-indexed) in the 4 x 4 photo above.
FOLD splits "1 down" into (floor, desk-row) and "2 across" into (wing, desk),
grouping by patch size 2. What are the four labels?
After SWAP the order becomes (floor, wing, desk-row, desk). After GLUE?
check: row 1 = 0×2 + 1 → floor 0, desk-row 1. col 2 = 1×2 + 0 → wing 1, desk 0.
SWAP: (floor=0, wing=1, desk-row=1, desk=0) -- band labels first, in-square labels last.
GLUE: chip = 0×2 + 1 = 1. spot = 1×2 + 0 = 2.
Chip 1 = top-right square [3, 4, 7, 8]. Spot 2 = the third dot = 7. ✓
## Cutting Threw Away WHERE -- So Stamp the Seat On (Sine/Cosine)
Attention looks at the four strips as a SET, with no sense of which sat where. Stamp a unique
seat-row onto each. The fixed recipe, for seat "pos" and slot pairs i = 0, 1, ...:
stamp[pos][2i] = sin( pos / 10000^(2i/width) )
stamp[pos][2i+1] = cos( pos / 10000^(2i/width) )
Worked at width 4 (slot pairs i = 0 and i = 1; 10000^(0/4)=1, 10000^(2/4)=100):
seat 0: sin(0)=0, cos(0)=1 | sin(0)=0, cos(0)=1 → [0, 1, 0, 1 ]
seat 1: sin(1)=0.841, cos(1)=0.540 | sin(0.01)=0.010, cos(0.01)=1.000 → [0.841, 0.540, 0.010, 1.000]
seat 2: sin(2)=0.909, cos(2)=-0.416 | sin(0.02)=0.020, cos(0.02)=1.000 → [0.909, -0.416, 0.020, 1.000]
Near seats (0 and 1) look close in the slow slots but differ in the fast slots.
Therefore every seat gets a unique stamp.
Add the stamp onto the strip, slot by slot.
This sine/cosine recipe is one way to make seat-stamps; it computes them with a fixed formula.
Another way is to LEARN a seat-table: one tuned row of numbers per seat, adjusted during training.
Both reach the same end -- a strip that carries where it sat.
>> YOUR TURN
Seat 0's stamp is [0,1,0,1]. A strip reads [2,2,2,2]. After adding the stamp?
check: [2+0, 2+1, 2+0, 2+1] = [2, 3, 2, 3].
## Every Strip Looks at Every Strip (the Grid)
Each strip makes three tags from three reused grids.
WANT (called the query) is what a strip looks for.
HAVE (called the key) is what a strip offers to be looked at.
GIVE (called the value) is what a strip hands over if it gets picked.
Two strips, tag width 4, so the scale divisor is √(tag width) = √4 = 2:
WANT HAVE GIVE
s1: [2,0,1,0] s1: [1,0,0,0] s1: [2,0,0,1]
s2: [0,0,2,0] s2: [3,0,2,0] s2: [0,3,1,0]
MATCH grid -- every WANT dotted with every HAVE:
· s1.HAVE · s2.HAVE
s1.WANT 2 8 (2·1=2 ; 2·3 + 1·2 = 8)
s2.WANT 0 4 (0 ; 2·2 = 4)
SCALE each by ÷2, then SOFTMAX each ROW into portions adding to 1.
SOFTMAX recipe: raise e (≈ 2.718) to each number, then divide each by the total of those raised values.
s1: [1, 4] → e^1=2.72, e^4=54.60, total 57.32 → [0.047, 0.953]
s2: [0, 2] → e^0=1.00, e^2=7.39, total 8.39 → [0.119, 0.881]
WEIGHTED SUM of the GIVE tags, per row:
s1_new = 0.047·[2,0,0,1] + 0.953·[0,3,1,0] = [0.094, 2.859, 0.953, 0.047]
s2_new = 0.119·[2,0,0,1] + 0.881·[0,3,1,0] = [0.238, 2.643, 0.881, 0.119]
>> YOUR TURN
s2's match row is [0, 4]. Scale by ÷2, then softmax. (You should get s2's portions above.)
check: ÷2 → [0, 2]. e^0=1, e^2=7.39, total 8.39 → [1/8.39, 7.39/8.39] = [0.119, 0.881].
The same three moves in numpy -- the whole grid at once:
scores = query @ key.T / np.sqrt(d_k)
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
output = weights @ value
query is (n, d_k), key is (n, d_k), so key.T is (d_k, n) and query @ key.T is (n, n) --
the full MATCH grid in one multiply. sum(axis=-1) adds across each ROW (one strip's match
numbers); keepdims=True keeps it shaped as a column so the divide lands slot by slot.
weights @ value is (n, n) @ (n, d_v) = (n, d_v): every strip's weighted sum of GIVE rows,
the whole board in one shot.
## Tame One Strip, Alone (LayerNorm)
Take ONE strip's numbers and humble them -- middle to 0, spread to 1 -- reading only this
strip, no other strip, no batch:
strip = [50, 60, 40, 30]
middle = (50+60+40+30)/4 = 45
diffs: 5, 15, -5, -15
squares: 25, 225, 25, 225
avg square: 500/4 = 125
spread = √125 ≈ 11.18
(number - 45) / 11.18 → [0.45, 1.34, -0.45, -1.34]
This move is called LayerNorm: it humbles using one row's own numbers and nothing else.
A different move, BatchNorm, takes the middle and spread across the whole batch of photos.
Which means BatchNorm reads many photos at once, while LayerNorm reads only this one strip.
Two learned dials -- gamma (stretch) and beta (shift) -- let the machine scale the humbled
numbers back if that helps. In numpy:
def layer_norm(x, gamma, beta, eps=1e-6):
mu = x.mean(axis=-1, keepdims=True) # each row's middle
var = x.var (axis=-1, keepdims=True) # average squared diff
x_n = (x - mu) / np.sqrt(var + eps) # humble: middle 0, spread 1
return gamma * x_n + beta # stretch and shift
axis=-1 = "across this row's own numbers" (the last direction in the array -- not across rows,
not across the batch). keepdims=True keeps the middle shaped as a column so the subtract lines
up slot by slot instead of broadcasting wrong.
>> YOUR TURN
Strip [2, 4, 6]. Middle? Then the diffs from the middle?
check: middle (2+4+6)/3 = 4. diffs: -2, 0, +2.
## Block the Future (the Causal Mask)
A token here is one item in the sequence -- one strip, or one word -- in the order it sits.
Lay the match scores in a grid: row i is the token doing the looking, column j is the token looked at.
To stop token i from seeing a later token, set every cell with j > i to minus infinity, BEFORE softmax.
Softmax raises e (≈ 2.718) to each score, and e^(−inf) = 0.
Therefore every blocked cell gets 0 portion -- a future token contributes nothing.
Three tokens, equal raw scores:
block j > i (-inf) portions after softmax
j1 j2 j3 j1 j2 j3
i1 . -inf -inf i1 1.0 0 0
i2 . . -inf i2 0.5 0.5 0
i3 . . . i3 0.33 0.33 0.33
No block at all = bidirectional (every cell allowed). The triangle of -inf = causal (past
only). That one switch is the whole difference between a reader (encoder: BERT, ViT) and a
writer (decoder: GPT).
>> YOUR TURN
Causal mask, token 2 of 3, equal raw scores. Its three portions?
check: token 2 sees tokens 1 and 2, not 3 → [0.5, 0.5, 0].
## One Recitation (Carry This Away)
Cut the photo into square strips (fold, swap, glue -- the swap keeps each square whole).
Stamp each strip's seat on (sine/cosine wave, or a learned table). Every strip's WANT dots
every strip's HAVE → ÷√width → softmax → portions → weighted sum of GIVE → each strip's new
numbers. Tame one strip alone (LayerNorm: middle 0, spread 1, this row only). And one switch --
leave the score grid open (bidirectional, a reader) or set the future to -inf (causal, a
writer). Strips that see everything, tamed one at a time, with or without a view of the future.
----------------------------------------------------------------------------------------------
APPENDIX E . VISION TRANSFORMER FROM PENCIL
See also: Chapter 11, Part 1: The Vision Transformer .
Chapter 11, Part 2: Encoder or Decoder
companion to: Appendix D: Transformer From Pencil
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================