==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 7 . BUILDING A NEURAL NETWORK FROM SCRATCH . PART 2 OF 2
Rolling Downhill by Hand: How a Neural Network Learns
Posted: 2026-06-11 . Author: Rahul Rai . Tags: backpropagation, gradient-descent, adam, dropout
============================================================================================
PATH . post 23 of 28
<- prev: Chapter 7, Part 1: How a Network Computes a Guess
next: Chapter 8: Five Machines Against Memorising ->
Part 1 built a building full of clerks and walked one patient through it: thirty
measurements in, a probability out. At the end the building guessed 0.622 (62.2% malignant)
for a patient whose true answer was 1, and we measured the wrongness: a loss of about 0.475.
But the building never improved. Its dials were random and stayed random. This post fixes
that. Here we make the dials LEARN -- and we do it the honest way, by computing the actual
slopes by hand, with the chain rule, on a tiny network you can hold in your head.
If you have not read Part 1, the one thing you need from it is this: a clerk takes its
inputs, multiplies each by a dial, adds them with a nudge to make a raw score Z, and either
bends Z at zero (the zero-out rule, between rooms) or squashes it into a probability with
the S-curve p = 1 / (1 + e^(-Z)) (at the exit). The loss when the true answer is 1 is
-ln(p). That is the whole forward machine. Now we run it backward.
## Why Brute Force Will Not Do
Our full building has more dials than you might guess. Let me count them.
Room 1: 30 measurements x 16 clerks + 16 nudges = 480 + 16 = 496
Room 2: 16 inputs x 8 clerks + 8 nudges = 128 + 8 = 136
Room 3: 8 inputs x 1 clerk + 1 nudge = 8 + 1 = 9
Total: 496 + 136 + 9 = 641 dials and nudges
The dumb way to improve them: take one dial, nudge it up a hair, run all 341 study patients
through, see if the loss dropped. Then nudge it down a hair, run all 341 again. Keep
whichever direction helped. Then move to the next dial.
2 directions x 641 dials x 341 patients = 437,162 full forward passes
...just to adjust the dials ONCE. And you must adjust them thousands of times. This is the
"done by Christmas" plan, and it is hopeless.
There is a far better way. It computes the slope of the loss for ALL 641 dials at once, in
a single backward sweep that costs about the same as one forward pass. It is called
backpropagation. And contrary to its fearsome reputation, on a small network it is just
the chain rule from calculus, applied a few times. Let me show you on a network so small
it fits in a sentence.
## A Network You Can Hold in Your Head
One measurement. One hidden clerk. One output clerk. That is the entire network.
x ---> [hidden clerk] ---> a1 ---> [output clerk] ---> p ---> loss
The numbers (I am choosing small round ones so every step is checkable):
Input: x = 0.5
Hidden clerk: w1 = 1.0, nudge b1 = 0.3
Output clerk: w2 = 1.5, nudge b2 = 0.2
True answer: y = 1
--- FORWARD PASS (from Part 1, so you can see where we start) ---
Hidden raw score: z1 = (x x w1) + b1 = (0.5 x 1.0) + 0.3 = 0.5 + 0.3 = 0.8
Zero-out rule: a1 = max(0, 0.8) = 0.8 (0.8 is positive, kept)
Output raw score: z2 = (a1 x w2) + b2 = (0.8 x 1.5) + 0.2 = 1.2 + 0.2 = 1.4
S-curve: p = 1 / (1 + e^(-1.4))
e^(-1.4) ≈ 0.247. p = 1 / 1.247 ≈ 0.802
Loss (true=1): L = -ln(0.802) ≈ 0.221
So the building currently guesses 0.802 and carries a loss of 0.221. We want to nudge the
four dials (w1, b1, w2, b2) so that next time the loss is smaller. To know which way to
nudge each one, we need its SLOPE: if I increase this dial a little, does the loss go up
or down, and how steeply?
## Chain Rule, Said Plainly
IN HAND: tiny network with numbers locked in -- x=0.5, w1=1.0, b1=0.3, w2=1.5, b2=0.2,
y=1. Forward pass gave z1=0.8, a1=0.8, z2=1.4, p=0.802, L=0.221.
This section finds the slope of L with respect to w2.
The loss does not depend on w2 directly. It depends on w2 through a chain:
w2 changes z2 (because z2 = a1 x w2 + b2)
z2 changes p (because p = S-curve of z2)
p changes L (because L = -ln(p))
The chain rule says: to get the slope of L with respect to w2, multiply the slopes along
the chain.
slope of L w.r.t. w2 = (slope of L w.r.t. p)
x (slope of p w.r.t. z2)
x (slope of z2 w.r.t. w2)
Let me compute each link. Two require calculus derivatives -- facts above the floor of this
chapter (the floor is: add, subtract, multiply, divide, squares, roots). I flag each as an
IOU and give a wiggle-check so you can verify the claim without the calculus.
IOU -- slope of -ln(p) w.r.t. p is -1/p:
(Follows from d/dx[ln x] = 1/x; proof belongs in a calculus chapter. Debt open.)
Why it makes sense: at p = 0.9 the slope is -1.1 (gentle push -- nearly right); at p = 0.1
the slope is -10 (hard yank -- deeply wrong). The wrongness bites hardest when you are
confident and wrong.
Wiggle check: -ln(0.792) ≈ 0.233, -ln(0.812) ≈ 0.208.
Rate = (0.208 - 0.233) / (0.812 - 0.792) = -0.025 / 0.020 = -1.25.
Formula at p = 0.802: -1/0.802 = -1.247. Agree to three places. ✓
IOU -- slope of S-curve p = 1/(1+e^{-Z}) w.r.t. Z is p x (1-p):
(Follows from the quotient rule applied to 1/(1+e^{-Z}); proof belongs in a calculus
chapter. Debt open.)
Why it makes sense: at p = 0.5, slope = 0.25 -- the steepest the S-curve ever gets, right
at the fence. At p = 0.99, slope = 0.01 -- nearly flat; a fully decided machine barely
moves when you nudge Z. Always between 0 and 0.25.
Wiggle check: at Z = 1.3, p ≈ 0.786; at Z = 1.5, p ≈ 0.818.
Rate = (0.818 - 0.786) / (1.5 - 1.3) = 0.032 / 0.2 = 0.16.
Formula at Z = 1.4: 0.802 x 0.198 = 0.159. Agree. ✓
Link 1 -- slope of L w.r.t. p:
L = -ln(p), so the slope is -1/p = -1/0.802 ≈ -1.247
Link 2 -- slope of p w.r.t. z2:
the S-curve's slope is p x (1 - p) = 0.802 x (1 - 0.802) = 0.802 x 0.198 ≈ 0.159
Link 3 -- slope of z2 w.r.t. w2:
z2 = a1 x w2 + b2. Increasing w2 by 1 increases z2 by a1. So the slope is a1 = 0.8
Multiply the chain:
slope of L w.r.t. w2 = (-1.247) x (0.159) x (0.8)
Do it in two steps:
(-1.247) x (0.159) ≈ -0.198
(-0.198) x (0.8) ≈ -0.158
The slope of the loss with respect to w2 is about -0.158.
A small miracle hides in those first two links. Watch:
(slope of L w.r.t. p) x (slope of p w.r.t. z2)
= (-1/p) x (p x (1 - p))
= -(1 - p)
= p - 1
= p - y (since y = 1 here)
The two ugly links collapse into p - y -- guess minus truth. This is not a coincidence of
these numbers; it is exactly why the S-curve and the -ln loss are used together. The error
that flows backward out of the output clerk is simply (guess - truth) = 0.802 - 1 = -0.198.
Clean enough to do in your head.
One slope that requires no chain at all: b2, the nudge on the output clerk. The formula is
z2 = a1 x w2 + b2, so raising b2 by 1 raises z2 by exactly 1 -- no dial, no input, just a
direct lift. The chain has one link only:
slope of L w.r.t. b2 = (error at z2) x (slope of z2 w.r.t. b2)
= (-0.198) x 1 = -0.198
New b2 = 0.2 - (0.1 x -0.198) = 0.2 + 0.020 = 0.220. Rule for every nudge: its slope equals
the error at the clerk it belongs to. No input factor, no chain to trace.
## Reading the Slope, and Taking a Step
The slope of L w.r.t. w2 is -0.158. NEGATIVE. What does that mean in plain words?
A negative slope means: increasing w2 DECREASES the loss.
So we should increase w2. By how much? Multiply the slope by a small step size (call it
0.1 -- the learning rate, more on it below) and subtract:
new w2 = w2 - (step x slope) = 1.5 - (0.1 x -0.158) = 1.5 + 0.0158 = 1.5158
We nudged w2 up, exactly as the negative slope advised. Note the pattern: we always
subtract step x slope. When the slope is negative, subtracting a negative ADDS -- the dial
goes up. When the slope is positive, the dial goes down. The minus sign does the steering
automatically. This single rule -- dial = dial - step x slope -- is gradient descent.
--- Does the slope tell the truth? Check it by brute force ---
We claimed increasing w2 lowers the loss with slope about -0.158. Let me verify the lazy
way: actually nudge w2 from 1.5 to 1.6 and recompute the loss from scratch.
w2 = 1.6: z2 = 0.8 x 1.6 + 0.2 = 1.48
p = 1 / (1 + e^(-1.48)) ; e^(-1.48) ≈ 0.228 ; p ≈ 1/1.228 ≈ 0.815
L = -ln(0.815) ≈ 0.205
The loss fell from 0.221 to 0.205 when w2 rose by 0.1. The measured slope is:
(0.205 - 0.221) / (1.6 - 1.5) = -0.016 / 0.1 = -0.16
Our chain-rule slope was -0.158. The brute-force slope is -0.16. They agree. The chain rule
got the same answer as actually wiggling the dial -- but it got it for all dials at once,
without 437,162 forward passes. THAT is backpropagation's whole reason to exist.
--- Your turn: verify b2 by wiggling ---
We derived that b2's slope is -0.198 (the error at z2). Verify this the lazy way: set b2 to
0.3 (raised by 0.1), recompute z2 and the loss, and check that the measured slope is close
to -0.198. (z2 = a1 x w2 + b2; a1 = 0.8, w2 = 1.5 stay fixed; only b2 changes.)
...
b2 = 0.3: z2 = 0.8 x 1.5 + 0.3 = 1.2 + 0.3 = 1.5
p = 1/(1 + e^{-1.5}) ; e^{-1.5} ≈ 0.223 ; p ≈ 1/1.223 ≈ 0.818
L = -ln(0.818) ≈ 0.201
Measured slope = (0.201 - 0.221) / (0.3 - 0.2) = -0.020 / 0.100 = -0.20.
Our chain-rule value: -0.198. Agree. ✓
A nudge changes z2 exactly one-for-one, so its slope IS the error -- nothing else dilutes it.
## Sending the Error One Room Further Back
IN HAND: error born at the output = p - y = -0.198. Slopes already found:
w2 slope = -0.158, new w2 = 1.516. b2 slope = -0.198, new b2 = 0.220.
This section sends that same error further left to find w1 and b1.
We have the slope for w2 and b2 (the output clerk). But how does the HIDDEN clerk's dial w1
learn? It sits one room back. The loss depends on w1 through a longer chain:
w1 changes z1 -> a1 (through the zero-out rule) -> z2 -> p -> L
The chain rule still works; we just multiply more links. And here is the trick that makes
it cheap: we already computed the error arriving at z2. We reuse it and keep going backward.
Here is the whole network drawn twice -- the forward pass on top (numbers flowing right to
a guess) and the backward pass below (the error flowing LEFT, getting multiplied at each
arrow). This single picture is the entire algorithm:
FORWARD (compute the guess) ------------------------------------------------>
x=0.5 z1=0.8 a1=0.8 z2=1.4 p=0.802 L=0.221
o ---xw1--> o ---ReLU--> o ---xw2----> o ---Scurve-> o ---(-ln)-> o
(=0.5x1.0 (max(0, (=0.8x1.5 (true y=1)
+0.3) 0.8)) +0.2)
<------------------------------------------------ BACKWARD (send error left)
dL/dw1 err@z1 err@a1 err@z2
=-0.148 <-x0.5- -0.297 <-gate x1- -0.297 <-x w2=1.5- -0.198 = (p - y)
^ ^ ^ ^
| | | |
multiply by multiply by multiply by the error is born
the input x the ReLU gate the dial w2 here: guess - truth
(=0.5) (1 open / 0 shut) it crosses = 0.802 - 1
Read the bottom row right to left. The error is BORN at the output as p - y = -0.198. It
travels left, and at every arrow it is multiplied by exactly one thing: the dial it crosses
(w2), the gate it passes through (1 if the clerk was open, 0 if dead), or -- when it finally
lands on a dial -- that dial's own input. Each landing point is a slope. Now the same thing
in arithmetic. Start from the error at z2, which is p - y = -0.198, and send it back:
Step A -- through the output dial w2 to reach a1:
z2 = a1 x w2 + b2, so increasing a1 by 1 increases z2 by w2 = 1.5.
error at a1 = (error at z2) x w2 = -0.198 x 1.5 ≈ -0.297
Step B -- through the zero-out gate to reach z1:
a1 = max(0, z1). For z1 = 0.8 (positive), the gate is OPEN: its slope is 1.
(If z1 had been negative, the gate would be SHUT, slope 0, and NO error passes back --
a dead clerk learns nothing. This is the dead-clerk problem from Part 1, seen from
the back.)
error at z1 = (error at a1) x 1 = -0.297
Step C -- through the dial w1 to reach w1's slope:
z1 = x x w1 + b1, so increasing w1 by 1 increases z1 by x = 0.5.
slope of L w.r.t. w1 = (error at z1) x x = -0.297 x 0.5 ≈ -0.148
slope of L w.r.t. b1 = (error at z1) x 1 = -0.297
So w1's slope is about -0.148: negative, so increasing w1 lowers the loss, so we nudge w1
up: new w1 = 1.0 - (0.1 x -0.148) = 1.0148. The same finite-difference check confirms it --
nudging w1 to 1.1 drops the loss to about 0.206, a measured slope of -0.15 against our
chain-rule -0.148. Agreement again, and this time the error had to travel through two rooms
to get there.
That is the entire algorithm. The error at the output (p - y) is computed once, then passed
backward room by room: multiply by the dial it crosses, multiply by the gate it passes
through (1 if the zero-out clerk was open, 0 if shut), and wherever it lands on a dial,
multiply by that dial's input to get the slope. One backward sweep, every slope, done.
--- Your turn ---
Suppose the output error (p - y) had come out as -0.40 instead of -0.198, with everything
else the same (w2 = 1.5, the hidden gate open at slope 1, input x = 0.5). What is the slope
of the loss with respect to w1?
...
error at a1 = -0.40 x 1.5 = -0.60
error at z1 = -0.60 x 1 = -0.60 (gate open)
slope w.r.t. w1 = -0.60 x 0.5 = -0.30
A bigger output error pushes a bigger slope back to w1 -- so w1 takes a bigger step. The
network corrects fastest exactly where it was most wrong.
## How Big a Step? (Learning Rate)
We used a step size of 0.1 above. That number is the learning rate, and it is its own small
art.
Step too big: the dial overshoots the bottom of the wrongness hill and lands further up
the other side. Next step it overshoots back. The loss oscillates or even explodes.
Step too small: the dial creeps. It will get there eventually, but you may run out of
patience (and compute budget) first.
I once set a fixed step that was slightly too large and spent two hours wondering why my
building was WORSE after training than before. The dials were bouncing around the valley
floor, never settling in it. Halving the step fixed it instantly. The lesson stuck: when
training diverges, suspect the step size first.
## A Manager Who Sizes the Steps (Adam)
A fixed step is crude: early on you want big strides, near the bottom you want tiny ones,
and different dials want different sizes. Rather than tune one number forever, the standard
practice is to hire a manager that sizes each dial's step automatically.
The popular one is called Adam, and it keeps TWO running averages for each dial, not one.
Average 1 -- the recent DIRECTION of the slope (its running mean). If a dial's slope has
pointed the same way for several passes, this average is large, and the dial keeps its
momentum -- it strides confidently in that direction instead of restarting from a
standstill each pass.
Average 2 -- the recent SIZE of the slope, regardless of sign (a running mean of the
slope SQUARED). Adam divides each step by the square root of this. So a dial whose
slopes have been large or jittery gets its step shrunk; a dial whose slopes have been
small and steady gets its step left long.
Put together: step direction comes from Average 1 (momentum), and step LENGTH is scaled
DOWN for dials with big or noisy slopes using Average 2. (Adam also applies a small early-
pass correction to both averages, since they start at zero and need a few passes to warm
up; that bookkeeping is not essential to the picture.) The effect is that each of the 641
dials gets its own self-sizing step. Adam is a choice, not a law -- plain gradient descent
with a hand-tuned step also works -- but Adam saves you the tuning, so I use it.
## Lazy Clerks and Coffee Breaks (Dropout)
Run the study loop for thirty or forty passes and a subtler failure appears. Among Room 1's
16 clerks, one happens to start with good dials and contribute a lot. The others discover
they can lower their own loss simply by amplifying whatever that one clerk says, instead of
learning anything themselves.
In mechanism terms (not just metaphor): several clerks' dials co-adapt so that their
outputs become near-copies of one strong clerk's output, scaled. The network leans on that
one feature detector and stops developing independent ones. It fits the 341 study patients
in fine detail -- including their noise -- so study loss keeps falling while practice loss
stalls and then climbs. That gap IS overfitting, and you watch it open in real time by
plotting study loss and practice loss together each pass.
The fix is almost rude in its simplicity: before each pass, randomly silence 20% of the
clerks -- force their output to zero for that pass. With 16 clerks, 16 x 0.20 = 3.2, so
about 3 clerks sit out each pass (which 3 changes randomly).
Because any clerk might be silenced on any pass, no clerk can rely on another being present.
Each must keep its own dials useful. The network is forced to spread the work across all 16
detectors instead of piling onto one. Silencing happens only during study; at practice and
exam time every clerk reports for duty (Keras handles this switch for you).
The 20% is a choice. On this dataset I tried 10%, 20%, and 30% and they landed in the same
neighbourhood -- this is a knob not worth agonising over. How many clerks sit out if 16
clerks face a 25% rate? 16 x 0.25 = 4. Four out, twelve working.
## When Numbers Explode (Numerical Stability)
One sharp edge from the real machine. The clerks usually compute in standard 32-bit
floating-point numbers (float32), and in that format the largest representable value is
roughly e^88. Push past it and the number overflows to "inf," and the next operation on
it tends to produce "nan" (not-a-number). The exact threshold depends on the number format,
the toolbox, and the hardware -- a 64-bit float reaches far higher -- but float32 is the
common default for training, so this is the edge you will actually meet.
The S-curve needs e^(-Z). If a raw score Z reaches -500, we compute e^(-(-500)) = e^500.
Since 500 is far past 88, the gear shatters: the output is "nan," and nan poisons
everything downstream -- the loss is nan, every slope is nan, every dial becomes nan.
Nothing recovers without a restart.
I hit this once by forgetting to humble the columns (Part 1). Raw radii in the hundreds,
times unlucky starting dials, sent a score past the overflow line on the very first forward
pass. The loss was nan before the first dial ever moved. Baffling until I checked whether
the inputs were scaled.
The fix is cheap: clip Z into a safe band before the S-curve.
if Z < -80, use -80 ; if Z > +80, use +80 ; otherwise leave Z alone
At Z = -80, S(-80) = 1 / (1 + e^80) ≈ 1.8 x 10^(-35) -- indistinguishable from 0 for any
medical decision. The clip changes the answer by less than one part in 10^34 and costs one
comparison. Always worth it. (This is the np.clip(Z, -80, 80) line in the Part 1 code, now
explained.)
## Three Mistakes Worth Knowing
I have made all three. The first burned half a day.
--- Mistake 1: Humbling the wrong pile ---
My first version called scaler.fit_transform(X) on all 569 rows before splitting. Natural-
feeling -- humble, then split -- but the mean and spread were computed from all patients,
exam pile included. The building had absorbed a statistical whiff of the exam answers
before grading. My reported accuracy was slightly fake.
Right: scaler.fit_transform(X_train), then scaler.transform on val and test.
Wrong: scaler.fit_transform(X) -- the exam pile helps set the ruler.
--- Mistake 2: Grading on the study pile ---
I ran model.evaluate(X_train_scaled, y_train) and saw 99.7% accuracy. I was thrilled for
about a minute, then I read what I had passed in. The building had spent 50 passes
memorising that exact pile. Scoring it there is a memory test, not a grade. Grade on the
sealed exam (X_test), never on the pile the network studied.
--- Mistake 3: Forgetting the S-curve at the exit ---
Room 3 emitted a raw 14.7. I fed it straight into the loss, which expects a number in
[0, 1]. -ln(14.7) is negative; the loss went negative; the slopes pointed the wrong way;
accuracy fell as "training" proceeded. Cause: I had put activation='relu' on the final
clerk instead of activation='sigmoid'. Zero-out belongs between rooms; the S-curve belongs
at the exit.
## Putting It All in Motion (The Real Run)
Everything above was one dial moving one step, by hand. A real run is just that same step
-- error born at the output, sent backward through every dial, each one nudged by step x
slope -- repeated for all 641 dials, over all 341 study patients, fifty times over. Nothing
new happens; it only happens faster and more often. Stack both posts together and let it
run. On my machine, 50 passes over the study pile gave:
train loss 0.07 . practice loss 0.15 . sealed-exam accuracy 0.974
The gap between train (0.07) and practice (0.15) was the overfitting tell from the dropout
section -- the building was starting to memorise. I raised dropout from 0.2 to 0.3 and
added 20 more passes; the gap closed to about 0.09 vs 0.14 with essentially the same exam
accuracy. Patient #203 in the exam pile drew a 0.91 malignant score but was benign -- she
had unusually high symmetry and concavity, and the building over-trusted those two
measurements. One odd case in a hundred is no reason to redesign the architecture, but it
is a standing reminder that 97.4% accuracy still means roughly three patients in every
hundred are told the wrong thing.
## Standard Names for Part 2
Plain term Standard label
---------------------------------- -------------------------------------------
slope of the loss for a dial gradient (partial derivative)
sending the error backward backpropagation
dial = dial - step x slope gradient descent update
step size learning rate
the manager who sizes steps Adam optimiser
one full pass over the study pile epoch
a handful of patients at a time mini-batch
error at the exit = guess - truth delta = (y_hat - y) for sigmoid + cross-entropy
lazy clerks copying one detector co-adaptation
coffee break dropout
gear shatter past e^88 float32 overflow
clipping Z to [-80, +80] numerical stability / sigmoid clipping
## Code, If You Want It
Nothing above needed a computer: the chain rule, the error flows, and every slope calculation
fit on scratch paper. This section is for the day you meet one.
Part 1 stopped at a built-but-untrained model. Here is the rest: compile it (choose the
manager, the wrongness ruler, and what to report), study it, and grade it once on the
sealed exam.
>> NEW TO PYTHON? Each named once:
model.compile(optimizer='adam') -- hire Adam as the step manager
loss='binary_crossentropy' -- the -ln wrongness ruler from Part 1
model.fit(validation_data=...) -- study, watching the practice pile each pass
epochs=50 -- 50 full passes over the study pile
batch_size=32 -- adjust dials after every 32 patients
model.evaluate(X_test_s, y_test) -- the sealed exam, once, at the very end
# (continues directly from the Part 1 code: X_train_s, X_val_s, X_test_s, model)
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'],
)
history = model.fit(
X_train_s, y_train,
epochs=50,
batch_size=32,
validation_data=(X_val_s, y_val), # the practice pile, watched but never studied
# TODO: add EarlyStopping on val_loss so it stops when practice loss turns upward
)
loss, acc = model.evaluate(X_test_s, y_test, verbose=0)
print(f"Sealed exam accuracy: {acc:.3f}") # my run: 0.974
I left the random_state at 42 throughout so you can reproduce my exact numbers. Drop it and
your accuracy will jitter by a percent or so from run to run -- itself a useful reminder
that the starting dials matter, and that a single number from a single seed is never the
whole story. The honest report is a band, not a point -- but that is a lesson for another
chapter.
That is a neural network, end to end, by hand: forward in Part 1, backward here. Every step
was arithmetic a tireless clerk could do. No magic turned the dials -- only the chain rule,
run backward, one slope at a time.
One thing this network does too well, though, is learn. Push it far enough and it stops
finding real patterns and starts memorising the study pile's freckles -- and flunks every
patient it has not seen. Curing that is the next chapter.
--> Continue: Chapter 8: Five Machines Against Memorising
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 7 -- Building a Neural Network from Scratch):
Part 1 -- How a Network Computes a Guess .
Part 2 (this post)
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================