==============================================================================================
RAHUL'S ML BLOG -- notes on machine learning, worked out by hand est. 2026
==============================================================================================
home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------
CHAPTER 7 . BUILDING A NEURAL NETWORK FROM SCRATCH . PART 1 OF 2
Stacked Rooms and One Walk by Hand: How a Neural Network Computes a Guess
Posted: 2026-06-11 . Author: Rahul Rai . Tags: neural-network, deep-learning, relu, sigmoid
============================================================================================
PATH . post 22 of 28
<- prev: Chapter 6, Part 6: Filling the Blanks
next: Chapter 7, Part 2: How a Network Learns ->
This post stands completely alone. If you have never read anything about machine learning,
start here. If you are an expert who wants to watch arithmetic happen one step at a time,
start here. By the end you will have built a neural network on paper and walked one patient
through it by hand, multiplication by multiplication.
This is Part 1 of two. Part 1 builds the network and computes a guess -- the forward pass.
Part 2 shows how the network LEARNS: how the dials change, worked by hand with the chain
rule. If you only want to calculate how a network produces an answer, Part 1 is complete
on its own.
A doctor has photographed cells from a breast lump and measured thirty things about each
one: the radius, the texture, the smoothness, and twenty-seven other properties. She has
done this for 569 patients. Every patient's verdict is already confirmed by biopsy:
malignant or benign. She wants a machine that studies those 569 cases and then makes the
call on patient 570 -- before the biopsy comes back.
The simplest machine that does this is one room of clerks. Each of the thirty measurements
gets multiplied by its own DIAL -- and a dial is just a number the machine will learn,
saying how much that one measurement counts toward the verdict. The thirty products are
added into a single running total. That total can come out any size, positive or negative,
so a final step squeezes it into a probability between 0 and 1 (a 0.8 reads as "80% chance
malignant"). That squeeze is not obvious and it is not magic: I derive it by hand later in
this post, in the section "Squashing Any Score Into a Probability." For now, take it on
credit -- a total goes in, a probability comes out.
Trained on the Wisconsin Breast Cancer dataset (the one I used -- you can load it with
sklearn.datasets.load_breast_cancer()) this one-room machine often lands around 95%
accuracy, though the exact figure wobbles with the split and the random seed.
But 95% is not 97%. On 569 patients that gap is about eleven people, and at least some of
those eleven will be told the wrong thing. That is why we build something harder.
This post builds the next machine: several rooms of clerks stacked together. Room 1
processes the raw measurements. Room 2 processes what Room 1 found. A final lonely clerk
turns everything into a probability. Done right, this building reaches the upper-90s.
Done wrong -- and I have done it wrong in several memorable ways -- it produces garbage
confidently. I will show you both.
Everything is done by hand. The only tools on the desk are pencils, scratch paper, and
a room of tireless clerks who can add, subtract, multiply and divide on demand.
## What Problem We Are Actually Solving
Before building anything, let me be precise about what we want.
We have a sheet of numbers. Each ROW is one patient. Each COLUMN is one measurement --
radius, texture, smoothness, and so on, thirty columns in total. One extra column, the
answer column, holds either 1 (malignant) or 0 (benign).
radius texture smoothness ... answer
patient 1: 17.99 10.38 0.118 ... 1 (malignant)
patient 2: 20.57 17.77 0.084 ... 1 (malignant)
patient 3: 19.69 21.25 0.110 ... 0 (benign)
...
patient 569: 7.76 24.54 0.053 ... 0 (benign)
Machine learning, in one sentence: find a mathematical rule that maps the thirty
measurements to the answer, by studying examples where we already know the answer.
The rule we are going to find is a building full of clerks, working in groups we will call
ROOMS. A room is nothing more than a bunch of clerks working side by side at the same stage
-- the word "room" is just a name for one row of them. Every clerk in a room takes the
numbers handed to it, multiplies each by its own personal dial, adds a fixed NUDGE (a nudge
is one more number the machine learns -- added on at the very end to shift the clerk's
result up or down), and writes down a single number.
Here is the part that trips people up, so let me be exact about who hands what to whom.
Clerks hand numbers to CLERKS, not to rooms -- "room" is only the grouping. Each clerk in a
room writes ONE number, and hands a copy of that number to EVERY clerk in the next room. So
the clerks in Room 1 read the thirty raw measurements; each Room-1 clerk writes one number;
every clerk in Room 2 reads all of Room-1's written numbers; and so on down the line. Stack
three such rooms and you have a neural network -- the simplest kind.
Why clerks? Because that is all the machine does: arithmetic. No gut feeling, no mystical
pattern recognition -- just multiply, add, BEND (force any negative running total up to
zero -- a trick we need shortly and explain in full in "Bending Scores at Zero"), and
repeat. Understanding it at the arithmetic level means you can debug it at the arithmetic
level, which is a skill you will use constantly.
## Splitting Evidence Into Three Honest Piles
Before touching any measurements, cut the 569 patients into three piles. This is the
first thing to do and the most important rule to follow: never let the machine study
from the same pile you use to grade it.
Why three piles, not two. First, one word we will lean on hard: a KNOB is a setting YOU
choose by hand before the machine starts learning -- exactly like the dial on a washing
machine, where you pick "cottons, 40 degrees, fast spin" before you press start. A knob is
not the same as a dial. The machine learns its own dials by itself, turning them as it
studies; but the knobs are yours to set, and the machine never touches them.
A one-room machine has only a knob or two to pick. Our building has many: how many rooms,
how many clerks per room, how many passes of study, how aggressively to silence lazy
clerks. If we pick all those knobs while peeking at the sealed exam, we are quietly shaping
the building to pass that one specific exam -- and the grade stops measuring how well the
building handles new patients.
So: a second pile, a PRACTICE pile, used freely during tuning. The sealed exam is touched
exactly once, at the very end, and its score is the honest report.
The recipe, with real arithmetic on our 569 patients:
First, seal 20% as the exam pile.
569 x 0.20 = 113.8 -> 114 patients sealed.
569 - 114 = 455 patients remain open.
Then hide 25% of those 455 as the practice pile.
455 x 0.25 = 113.75 -> 114 patients in practice.
(25% of 80% equals 20% of the total -- same size as the exam pile.)
What is left is the study pile.
455 - 114 = 341 patients.
Check: 341 + 114 + 114 = 569. All patients accounted for, none double-counted.
A picture of the two cuts:
[ 569 patients ]
/ \
/ \
[ 455 open ] [ 114 exam (sealed, never peeked) ]
/ \
/ \
[ 341 study ] [ 114 practice (graded freely during tuning) ]
The 60/20/20 fractions are a choice, not a law. Some people use 70/15/15 or 80/10/10.
What is not a choice: the exam pile is sealed first, before you look at anything.
--- A small problem for you ---
We have 455 patients and want the same 60/20/20 fractions.
How many go in each pile? Work it out before reading on.
...
Seal 20%: 455 x 0.20 = 91 in the exam pile. 455 - 91 = 364 remain.
Practice is 25% of those: 364 x 0.25 = 91. Study: 364 - 91 = 273.
Check: 273 + 91 + 91 = 455. Done.
## Putting Every Column on One Ruler
Look at two columns: radius (about 7 to 28) and smoothness (about 0.05 to 0.16). Radius
numbers are roughly 200 times larger than smoothness numbers.
When a clerk multiplies both by a dial, the radius dial must stay small to compensate for
those large numbers, and the smoothness dial must grow large to compensate for those tiny
ones. The machine spends its effort managing the scale gap rather than finding which
measurements matter. The smaller column struggles to get heard during the early going,
and the dials take far longer to settle. To be precise: scaling does not delete the small
feature -- the network can still learn from it -- but unscaled inputs make the optimiser's
job dramatically harder and slower.
The fix is to put every column on the same ruler:
humbled value = (raw value - study mean) / study spread
where "study mean" is the average of that column across the 341 study patients, and
"study spread" is how much that column varies (the standard deviation). After this,
every column has average 0 and spread 1. The radius dial and smoothness dial now multiply
numbers of similar size.
The rule that must not be broken: compute the mean and spread ONLY from the 341 study
patients, then apply that same ruler to all three piles.
Why only the study pile? Because the exam pile is supposed to be patients the machine has
never seen. If we compute the mean across all 569 patients -- including the exam pile --
the exam patients' values have shaped the ruler. The transformation applied to the exam
pile is no longer blind: it was calibrated to the exam pile's own numbers. Subtle, but a
real form of cheating.
A concrete example. The study pile has mean 15.0 and spread 2.0 for the radius column.
A practice-pile patient has raw radius 17.0:
(17.0 - 15.0) / 2.0 = 2.0 / 2.0 = 1.0
One spread above the study average -- mildly large radius.
Try this: same study mean (15.0) and spread (2.0), but a patient with raw radius 11.0.
Humbled score?
...
(11.0 - 15.0) / 2.0 = -4.0 / 2.0 = -2.0. Two spreads below average.
## Building a Room of Clerks
A single room has this shape. Say it has 16 clerks and each patient has 30 measurements.
Every clerk connects to ALL 30 measurements. Each connection has a dial (a number the
machine will learn). Each clerk also has one fixed nudge. Each clerk multiplies each
incoming measurement by its dial, adds the thirty products, adds its nudge, and outputs
one number:
raw score = (m1 x dial_1) + (m2 x dial_2) + ... + (m30 x dial_30) + nudge
With 16 clerks and 30 measurements:
Dials: 16 clerks x 30 dials each = 480 dials
Nudges: 16
Total numbers Room 1 must learn: 480 + 16 = 496
The 480 dials form a grid: 30 rows (one per measurement) by 16 columns (one per clerk).
This grid is the weight matrix. One patient's measurements flow in the top; 16 numbers
flow out the side.
dial grid (30 x 16)
+-+-+-+-+-- --+-+
m1: radius -----> |d|d|d|d| ... |d|
m2: texture -----> |d|d|d|d| ... |d|
... -----> ...
m30: symmetry -----> |d|d|d|d| ... |d|
+-+-+-+-+-- --+-+
| | |
c1 c2 ... c16 (16 raw scores out)
How much work? For one patient, Room 1 does:
30 multiplications x 16 clerks = 480 multiplications
30 additions per clerk x 16 clerks = 480 additions (29 to sum + 1 nudge)
Total for one patient: 960 arithmetic steps
For all 341 study patients in one pass: 341 x 960 = 327,360 operations. A room of tireless
clerks handles this before lunch. A single pencil would take months.
## Why Plain Stacking Falls Apart
IN HAND: Room 1 built (16 clerks, 30 inputs, 496 dials). For one patient, Room 1 does
480 multiplications + 480 additions = 960 steps. The zero-out rule will bend negatives.
This section shows what happens when two plain rooms are stacked without that bend.
Now stack two rooms. Room 1 feeds 16 numbers to Room 2, which has 8 clerks. Each Room 2
clerk connects to all 16 inputs from Room 1.
If both rooms do nothing but multiply and add, something quietly catastrophic happens.
Watch just two measurements and two clerks for simplicity.
Clerk A in Room 1 computes: zA = (m1 x w1) + (m2 x w2)
Clerk B in Room 2 takes zA and applies its dial v: zB = zA x v
Substituting zA:
zB = ((m1 x w1) + (m2 x w2)) x v
= (m1 x w1 x v) + (m2 x w2 x v)
That final expression is exactly what a SINGLE clerk would compute with combined dials
(w1 x v) and (w2 x v). Two rooms of arithmetic collapsed into one. Add a third room, a
hundredth -- same thing. The product of any number of dial grids is still just one dial
grid. Stack a hundred plain rooms and they are arithmetically identical to a single room.
Read this carefully, because the precise statement matters: stacking plain LINEAR rooms --
rooms that only multiply and add -- collapses into a single linear transformation. It is
not that "more layers are useless." It is that more layers are useless WITHOUT something
non-linear between them. That something is the next section.
The fix must be a rule that cannot be absorbed into a dial multiplication. Multiplying by
a dial always produces a smooth straight line through the origin. We need a BEND -- a sharp
corner no multiplication can reproduce.
## Bending Scores at Zero
Between each room, apply one rule to every score leaving the room:
If the score is below zero, replace it with zero.
If the score is zero or above, leave it alone.
Three examples so there is no ambiguity:
score = +2.4 -> keep 2.4
score = -0.7 -> write 0
score = 0.0 -> write 0 (the boundary counts as non-negative)
This rule is called ReLU in the literature (Rectified Linear Unit -- more on names at the
end). I will call it the zero-out rule until then.
Why does this fix the collapse? The zero-out rule has a sharp corner at zero: flat below,
diagonal above. No multiplication can produce that corner. The collapse argument required
every step to be a multiplication; insert the zero-out rule and the argument breaks.
A clean way to see it: take Z = -3 and scale it. Multiplication gives -3 x 2 = -6. But
zero-out then multiplication gives max(0, -3) x 2 = 0, not -6. The zero-out rule changed
the value in a way no dial can undo.
One failure mode worth knowing. If a clerk's raw score is negative for EVERY study patient,
that clerk always outputs zero. The next room never hears from it; the clerk is "dead." If
all 16 clerks in Room 1 die at once, the building is blind. I caused this once by setting
every starting dial to the same large negative value. Frameworks default to small random
dials specifically to avoid it.
Try this: four scores come out of Room 1: [3.1, -0.2, 0.0, -5.8]. Apply the zero-out rule.
...
[3.1, 0, 0, 0]. Only the non-negative score passes; the rest become zero.
## Squashing Any Score Into a Probability
Room 1 bends its outputs and passes them on; Room 2 does the same; the final lonely clerk
adds everything with its dials and nudge and produces one raw score -- any real number.
But the doctor wants a PROBABILITY between 0 and 1. The zero-out rule will not do it -- it
still allows any positive number. We need something that approaches 1 as the score grows
and 0 as it drops.
The obvious first try, 1 / |Z|, fails twice: it explodes at Z = 0 (division by zero), and
it gives the same answer for Z = 100 and Z = -100 -- one screaming malignant, one screaming
benign, both mapped to the same probability. Useless.
Here is an approach that works. Instead of the probability p directly, think about the
ODDS: chance of malignant divided by chance of not.
p = 0.75 (75% malignant): odds = 0.75 / 0.25 = 3
p = 0.50 (even money): odds = 0.50 / 0.50 = 1
p = 0.25: odds = 0.25 / 0.75 = 1/3
Odds run from 0 to infinity as p runs from 0 to 1 -- a nicer range.
Now one modelling choice: assume each +1 step in the raw score Z multiplies the odds by a
fixed amount. Call it e (the natural base, about 2.718). This is a decision, not a law --
it happens to make the learning calculus in Part 2 clean. So:
odds = e^Z
p / (1 - p) = e^Z
Solve for p. Four lines of algebra:
p = e^Z - p x e^Z (multiply both sides by (1 - p) and expand)
p + p x e^Z = e^Z (move the p term left)
p x (1 + e^Z) = e^Z (factor out p)
p = e^Z / (1 + e^Z) (divide)
Multiply top and bottom by e^(-Z):
p = 1 / (1 + e^(-Z))
That is the S-curve. Check it with three values (these hand values are rounded -- I mark
approximations with "≈"):
Z = 0:
e^(-0) = e^0 = 1. p = 1 / (1 + 1) = 1/2 = 0.50 (raw score zero -> 50/50, on the fence)
Z = +4:
e^2 = 2.718 x 2.718 ≈ 7.39. e^4 ≈ 7.39 x 7.39 ≈ 54.6. e^(-4) ≈ 1/54.6 ≈ 0.018.
p = 1 / (1 + 0.018) ≈ 0.982 (strongly malignant)
Z = -4:
e^(-(-4)) = e^4 ≈ 54.6.
p = 1 / (1 + 54.6) ≈ 0.018 (strongly benign)
As Z grows, p approaches 1; as Z shrinks, p approaches 0; at Z = 0, exactly 0.50. No
division by zero. A large positive and a large negative give opposite probabilities, as
they should. This squasher is applied ONLY to the final clerk's output -- never between
inner rooms, where the zero-out rule does the work.
Try the squasher for Z = 0 yourself. The formula is 1 / (1 + e^(-Z)). At Z = 0, e^(-0) = ?
...
e^0 = 1. So 1 / (1 + 1) = 1/2 = 0.50. Exactly on the fence.
## Walking One Patient Through, Start to Finish
The building is complete on paper. Let me walk one patient through it by hand.
I will use a toy building: 2 measurements, 2 clerks in Room 1, 1 final clerk. A real
building has 30 measurements and 16 clerks in Room 1 -- the arithmetic is identical, just
more rows. The toy keeps the numbers on one page.
Patient on the desk. Two humbled measurements: [0.5, -1.2]. True answer: 1 (malignant).
ROOM 1 has 2 clerks.
Clerk 1: dials [2.0, 1.0], nudge 0.1.
Clerk 2: dials [-1.0, 0.5], nudge -0.2.
--- Clerk 1 ---
first 0.5 x 2.0 = 1.0
then -1.2 x 1.0 = -1.2
add 1.0 + (-1.2) = -0.2
nudge -0.2 + 0.1 = -0.1
Clerk 1 raw score: -0.1
--- Clerk 2 ---
first 0.5 x (-1.0) = -0.5
then -1.2 x 0.5 = -0.6
add -0.5 + (-0.6) = -1.1
nudge -1.1 + (-0.2) = -1.3
Clerk 2 raw score: -1.3
Raw paper out of Room 1: [-0.1, -1.3]
--- Zero-out rule ---
-0.1 < 0 -> write 0
-1.3 < 0 -> write 0
Paper entering Room 2: [0, 0]
Both clerks went negative; both became zero. This is the dead-clerk situation live. Room 2
receives all zeros. The building is not broken -- the final clerk's nudge still carries a
signal -- but this patient was erased by Room 1. If it happened to every patient we would
have a real problem; here it happens to this one patient with these specific random dials.
After studying (Part 2), the dials adjust so patients get non-zero signals through. We are
only watching the first forward pass.
--- Final clerk ---
Room 2 has 1 clerk: dials [3.0, -2.0], nudge 0.5.
first 0 x 3.0 = 0
then 0 x (-2.0) = 0
add 0 + 0 = 0
nudge 0 + 0.5 = 0.5
Final raw score: 0.5
--- S-curve ---
p = 1 / (1 + e^(-0.5))
e^0.5 is sqrt(e). e ≈ 2.718. 1.649 x 1.649 ≈ 2.719, so e^0.5 ≈ 1.649, e^(-0.5) ≈ 0.607.
1 + 0.607 = 1.607. 1 / 1.607 ≈ 0.622.
Building's verdict: 0.622, or about 62.2% chance malignant. True answer: 1. Leaning the
right way -- just not confidently yet. After studying (Part 2), the verdict on this kind
of patient will sharpen.
--- Your turn ---
Same toy building. New patient: three humbled measurements [1, 2, 3].
Room 1 has 2 clerks with 3 dials each:
Clerk 1: dials [0.1, 0.2, 0.1], nudge 0.5.
Clerk 2: dials [-0.1, 0.0, 0.2], nudge 0.0.
Compute CLERK 2's raw score, line by line.
...
first 1 x (-0.1) = -0.1
then 2 x 0.0 = 0.0
then 3 x 0.2 = 0.6
sum: -0.1 + 0.0 + 0.6 = 0.5
add nudge: 0.5 + 0.0 = 0.5
Clerk 2's raw score = 0.5.
Clerk 1 would give (1x0.1)+(2x0.2)+(3x0.1)+0.5 = 0.1+0.4+0.3+0.5 = 1.3. Neither is
negative, so the zero-out rule keeps both. A good sign.
## How Wrong Was That Guess?
IN HAND: patient [0.5, -1.2], true answer 1. Room 1 raw scores [-0.1, -1.3]; zero-out
gave [0, 0]; final clerk raw score 0.5; S-curve gave p = 0.622.
This section measures the wrongness of guessing 0.622 when the truth is 1.
We have a guess (0.622) and a true answer (1). We need to measure the wrongness -- both to
report it and, in Part 2, to know which way to turn the dials.
The obvious ruler -- square the difference -- treats probabilities like plain numbers and
undersells confident wrong answers. A machine that says 0.001 (99.9% sure benign) on a
malignant patient should be punished far more than one that says 0.4. Squaring barely
separates them: (0.001 - 1)^2 ≈ 0.998 versus (0.4 - 1)^2 = 0.36 -- a factor of three, when
the first is a catastrophe and the second is a near-miss.
What we want: a wrongness that grows without limit as the machine gets more confident in
the wrong direction. The natural logarithm does exactly this.
When the true answer is 1: loss = -ln(guess).
guess = 0.999: loss = -ln(0.999) ≈ 0.001 (nearly right, tiny loss)
guess = 0.622: loss = -ln(0.622) ≈ ?
guess = 0.100: loss = -ln(0.100) ≈ 2.303 (badly wrong)
guess = 0.001: loss = -ln(0.001) ≈ 6.908 (catastrophic)
For our patient (guess 0.622, true 1). I know ln(0.5) = -0.693 and ln(1.0) = 0. A finer
anchor: ln(0.6) ≈ -0.511, and 0.622 is just above 0.6, so ln(0.622) ≈ -0.475.
loss = -(-0.475) = 0.475. (The exact value is 0.4748 -- my hand estimate is good to three
places. I am estimating logs by anchoring between values I know; do not mistake these for
exact figures.)
When the true answer is 0, the formula flips: loss = -ln(1 - guess), punishing high
guesses on benign patients.
Try this: a benign patient (true 0). Building guesses 0.30. Loss?
...
loss = -ln(1 - 0.30) = -ln(0.70). Anchoring: ln(0.5) = -0.693, ln(1.0) = 0, 0.7 is 40% of
the way up, so ln(0.7) ≈ -0.357. loss ≈ 0.357. The building said 30% malignant on a well
patient -- correct direction, wrong confidence, moderate loss.
The full wrongness across all 341 study patients is the average of these individual losses.
As the dials improve, this average falls. Watching it fall -- alongside the separate
practice-pile loss -- is how we know the building is learning.
And HOW the dials change to make that loss fall is the whole of Part 2.
## Standard Names for Part 1
Plain terms used above, translated to the labels you will meet in papers and docs:
Plain term Standard label
--------------------------------- -------------------------------------------
building full of clerks neural network / deep learning
room of clerks hidden layer
final lonely clerk output neuron
dial weight (W)
fixed nudge bias (b)
dial grid weight matrix
study / practice / sealed exam train / validation / test
humbling the numbers feature scaling / StandardScaler
three-cut split train-validation-test split
zero-out rule ReLU (Rectified Linear Unit)
S-curve squasher sigmoid activation
wrongness of one guess binary cross-entropy loss
passing a patient through forward propagation
(The learning words -- gradient, backpropagation, learning rate, Adam, dropout -- arrive
in Part 2, where they belong.)
## Code, If You Want It
Nothing above needed a computer: every step was pencil arithmetic a tireless clerk could do.
This section is for the day you meet one.
Here is the forward half in Python: load, split, scale, build, and the by-hand forward
pass in code so you can check the 0.622 we computed above. The training call lives in
Part 2.
>> NEW TO PYTHON? Each function named once:
load_breast_cancer() -- the 569-patient, 30-measurement dataset
train_test_split(test_size=0.20) -- seal 20% as the exam pile
scaler.fit_transform(X_train) -- learn the ruler from the study pile AND apply it
scaler.transform(X_val) -- apply the SAME ruler without re-learning
Dense(16, activation='relu') -- Room 1: 16 clerks with the zero-out rule
Dense(1, activation='sigmoid') -- final clerk with the S-curve squasher
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
SEED = 42
data = load_breast_cancer() # 569 patients, 30 measurements
X, y = data.data, data.target
# Three honest piles: 60% study, 20% practice, 20% exam
X_temp, X_test, y_temp, y_test = train_test_split(X, y,
test_size=0.20, random_state=SEED)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp,
test_size=0.25, random_state=SEED)
# Humble the columns -- fit on the study pile ONLY
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train) # learn mean/spread from study + apply
X_val_s = scaler.transform(X_val) # same ruler, applied blindly
X_test_s = scaler.transform(X_test) # same ruler, exam never peeked
# Build the building: Room 1 (16 clerks), Room 2 (8 clerks), Room 3 (1 clerk)
model = Sequential([
Dense(16, activation='relu', input_shape=(X_train_s.shape[1],)),
Dropout(0.2), # coffee break -- explained in Part 2
Dense(8, activation='relu'),
Dropout(0.2),
Dense(1, activation='sigmoid'), # S-curve exit: output is a probability
])
# The by-hand forward pass, in code (reproduces the 0.622 from above)
def forward_pass_by_hand(X, weights):
A = X.copy()
for i in range(0, len(weights) - 2, 2):
W, b = weights[i], weights[i+1]
Z = A @ W + b
A = np.maximum(0.0, Z) # zero-out rule between rooms
W, b = weights[-2], weights[-1]
Z = A @ W + b
Z = np.clip(Z, -80, 80) # clip before the S-curve (why: Part 2)
return 1.0 / (1.0 + np.exp(-Z)) # S-curve at the exit
The model is built but not yet trained -- its dials are still random. Teaching those dials
to fall down the wrongness hill, worked by hand with the chain rule, is Part 2.
--> Continue: Chapter 7, Part 2: How a Network Learns
----------------------------------------------------------------------------------------------
IN THIS CHAPTER (Chapter 7 -- Building a Neural Network from Scratch):
Part 1 (this post) .
Part 2 -- How a Network Learns
<- Back to all posts
----------------------------------------------------------------------------------------------
(c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
home . source on GitHub
==============================================================================================