Stacked Rooms and One Walk by Hand: How a Neural Network Computes a Guess

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 7 . BUILDING A NEURAL NETWORK FROM SCRATCH . PART 1 OF 2
  Stacked Rooms and One Walk by Hand: How a Neural Network Computes a Guess
  Posted: 2026-06-11 . Author: Rahul Rai . Tags: neural-network, deep-learning, relu, sigmoid
  ============================================================================================

  PATH . post 22 of 28
    <- prev:  Chapter 6, Part 6: Filling the Blanks
       next:  Chapter 7, Part 2: How a Network Learns ->

  This post stands completely alone. If you have never read anything about machine learning,
  start here. If you are an expert who wants to watch arithmetic happen one step at a time,
  start here. By the end you will have built a neural network on paper and walked one patient
  through it by hand, multiplication by multiplication.

  This is Part 1 of two. Part 1 builds the network and computes a guess -- the forward pass.
  Part 2 shows how the network LEARNS: how the dials change, worked by hand with the chain
  rule. If you only want to calculate how a network produces an answer, Part 1 is complete
  on its own.


  A doctor has photographed cells from a breast lump and measured thirty things about each
  one: the radius, the texture, the smoothness, and twenty-seven other properties. She has
  done this for 569 patients. Every patient's verdict is already confirmed by biopsy:
  malignant or benign. She wants a machine that studies those 569 cases and then makes the
  call on patient 570 -- before the biopsy comes back.

  The simplest machine that does this is one room of clerks. Each of the thirty measurements
  gets multiplied by its own DIAL -- and a dial is just a number the machine will learn,
  saying how much that one measurement counts toward the verdict. The thirty products are
  added into a single running total. That total can come out any size, positive or negative,
  so a final step squeezes it into a probability between 0 and 1 (a 0.8 reads as "80% chance
  malignant"). That squeeze is not obvious and it is not magic: I derive it by hand later in
  this post, in the section "Squashing Any Score Into a Probability." For now, take it on
  credit -- a total goes in, a probability comes out.

  Trained on the Wisconsin Breast Cancer dataset (the one I used -- you can load it with
  sklearn.datasets.load_breast_cancer()) this one-room machine often lands around 95%
  accuracy, though the exact figure wobbles with the split and the random seed.

  But 95% is not 97%. On 569 patients that gap is about eleven people, and at least some of
  those eleven will be told the wrong thing. That is why we build something harder.

  This post builds the next machine: several rooms of clerks stacked together. Room 1
  processes the raw measurements. Room 2 processes what Room 1 found. A final lonely clerk
  turns everything into a probability. Done right, this building reaches the upper-90s.
  Done wrong -- and I have done it wrong in several memorable ways -- it produces garbage
  confidently. I will show you both.

  Everything is done by hand. The only tools on the desk are pencils, scratch paper, and
  a room of tireless clerks who can add, subtract, multiply and divide on demand.


  ## What Problem We Are Actually Solving

  Before building anything, let me be precise about what we want.

  We have a sheet of numbers. Each ROW is one patient. Each COLUMN is one measurement --
  radius, texture, smoothness, and so on, thirty columns in total. One extra column, the
  answer column, holds either 1 (malignant) or 0 (benign).

             radius  texture  smoothness  ...  answer
  patient 1:  17.99    10.38       0.118  ...    1 (malignant)
  patient 2:  20.57    17.77       0.084  ...    1 (malignant)
  patient 3:  19.69    21.25       0.110  ...    0 (benign)
  ...
  patient 569: 7.76    24.54       0.053  ...    0 (benign)

  Machine learning, in one sentence: find a mathematical rule that maps the thirty
  measurements to the answer, by studying examples where we already know the answer.

  The rule we are going to find is a building full of clerks, working in groups we will call
  ROOMS. A room is nothing more than a bunch of clerks working side by side at the same stage
  -- the word "room" is just a name for one row of them. Every clerk in a room takes the
  numbers handed to it, multiplies each by its own personal dial, adds a fixed NUDGE (a nudge
  is one more number the machine learns -- added on at the very end to shift the clerk's
  result up or down), and writes down a single number.

  Here is the part that trips people up, so let me be exact about who hands what to whom.
  Clerks hand numbers to CLERKS, not to rooms -- "room" is only the grouping. Each clerk in a
  room writes ONE number, and hands a copy of that number to EVERY clerk in the next room. So
  the clerks in Room 1 read the thirty raw measurements; each Room-1 clerk writes one number;
  every clerk in Room 2 reads all of Room-1's written numbers; and so on down the line. Stack
  three such rooms and you have a neural network -- the simplest kind.

  Why clerks? Because that is all the machine does: arithmetic. No gut feeling, no mystical
  pattern recognition -- just multiply, add, BEND (force any negative running total up to
  zero -- a trick we need shortly and explain in full in "Bending Scores at Zero"), and
  repeat. Understanding it at the arithmetic level means you can debug it at the arithmetic
  level, which is a skill you will use constantly.


  ## Splitting Evidence Into Three Honest Piles

  Before touching any measurements, cut the 569 patients into three piles. This is the
  first thing to do and the most important rule to follow: never let the machine study
  from the same pile you use to grade it.

  Why three piles, not two. First, one word we will lean on hard: a KNOB is a setting YOU
  choose by hand before the machine starts learning -- exactly like the dial on a washing
  machine, where you pick "cottons, 40 degrees, fast spin" before you press start. A knob is
  not the same as a dial. The machine learns its own dials by itself, turning them as it
  studies; but the knobs are yours to set, and the machine never touches them.

  A one-room machine has only a knob or two to pick. Our building has many: how many rooms,
  how many clerks per room, how many passes of study, how aggressively to silence lazy
  clerks. If we pick all those knobs while peeking at the sealed exam, we are quietly shaping
  the building to pass that one specific exam -- and the grade stops measuring how well the
  building handles new patients.

  So: a second pile, a PRACTICE pile, used freely during tuning. The sealed exam is touched
  exactly once, at the very end, and its score is the honest report.

  The recipe, with real arithmetic on our 569 patients:

    First, seal 20% as the exam pile.
      569 x 0.20 = 113.8  ->  114 patients sealed.
      569 - 114 = 455 patients remain open.

    Then hide 25% of those 455 as the practice pile.
      455 x 0.25 = 113.75  ->  114 patients in practice.
      (25% of 80% equals 20% of the total -- same size as the exam pile.)

    What is left is the study pile.
      455 - 114 = 341 patients.

    Check: 341 + 114 + 114 = 569. All patients accounted for, none double-counted.

  A picture of the two cuts:

                       [ 569 patients ]
                          /          \
                         /            \
                 [ 455 open ]      [ 114 exam (sealed, never peeked) ]
                    /        \
                   /          \
        [ 341 study ]    [ 114 practice (graded freely during tuning) ]

  The 60/20/20 fractions are a choice, not a law. Some people use 70/15/15 or 80/10/10.
  What is not a choice: the exam pile is sealed first, before you look at anything.

  --- A small problem for you ---

  We have 455 patients and want the same 60/20/20 fractions.
  How many go in each pile? Work it out before reading on.

  ...

  Seal 20%: 455 x 0.20 = 91 in the exam pile. 455 - 91 = 364 remain.
  Practice is 25% of those: 364 x 0.25 = 91. Study: 364 - 91 = 273.
  Check: 273 + 91 + 91 = 455. Done.


  ## Putting Every Column on One Ruler

  Look at two columns: radius (about 7 to 28) and smoothness (about 0.05 to 0.16). Radius
  numbers are roughly 200 times larger than smoothness numbers.

  When a clerk multiplies both by a dial, the radius dial must stay small to compensate for
  those large numbers, and the smoothness dial must grow large to compensate for those tiny
  ones. The machine spends its effort managing the scale gap rather than finding which
  measurements matter. The smaller column struggles to get heard during the early going,
  and the dials take far longer to settle. To be precise: scaling does not delete the small
  feature -- the network can still learn from it -- but unscaled inputs make the optimiser's
  job dramatically harder and slower.

  The fix is to put every column on the same ruler:

      humbled value = (raw value - study mean) / study spread

  where "study mean" is the average of that column across the 341 study patients, and
  "study spread" is how much that column varies (the standard deviation). After this,
  every column has average 0 and spread 1. The radius dial and smoothness dial now multiply
  numbers of similar size.

  The rule that must not be broken: compute the mean and spread ONLY from the 341 study
  patients, then apply that same ruler to all three piles.

  Why only the study pile? Because the exam pile is supposed to be patients the machine has
  never seen. If we compute the mean across all 569 patients -- including the exam pile --
  the exam patients' values have shaped the ruler. The transformation applied to the exam
  pile is no longer blind: it was calibrated to the exam pile's own numbers. Subtle, but a
  real form of cheating.

  A concrete example. The study pile has mean 15.0 and spread 2.0 for the radius column.
  A practice-pile patient has raw radius 17.0:

      (17.0 - 15.0) / 2.0 = 2.0 / 2.0 = 1.0

  One spread above the study average -- mildly large radius.

  Try this: same study mean (15.0) and spread (2.0), but a patient with raw radius 11.0.
  Humbled score?

  ...

  (11.0 - 15.0) / 2.0 = -4.0 / 2.0 = -2.0. Two spreads below average.


  ## Building a Room of Clerks

  A single room has this shape. Say it has 16 clerks and each patient has 30 measurements.
  Every clerk connects to ALL 30 measurements. Each connection has a dial (a number the
  machine will learn). Each clerk also has one fixed nudge. Each clerk multiplies each
  incoming measurement by its dial, adds the thirty products, adds its nudge, and outputs
  one number:

      raw score = (m1 x dial_1) + (m2 x dial_2) + ... + (m30 x dial_30) + nudge

  With 16 clerks and 30 measurements:
    Dials: 16 clerks x 30 dials each = 480 dials
    Nudges: 16
    Total numbers Room 1 must learn: 480 + 16 = 496

  The 480 dials form a grid: 30 rows (one per measurement) by 16 columns (one per clerk).
  This grid is the weight matrix. One patient's measurements flow in the top; 16 numbers
  flow out the side.

                             dial grid (30 x 16)
                            +-+-+-+-+--   --+-+
    m1: radius      ----->  |d|d|d|d|  ...  |d|
    m2: texture     ----->  |d|d|d|d|  ...  |d|
    ...             ----->  ...
    m30: symmetry   ----->  |d|d|d|d|  ...  |d|
                            +-+-+-+-+--   --+-+
                              |  |           |
                             c1 c2  ...     c16   (16 raw scores out)

  How much work? For one patient, Room 1 does:
    30 multiplications x 16 clerks      = 480 multiplications
    30 additions per clerk x 16 clerks  = 480 additions (29 to sum + 1 nudge)
    Total for one patient: 960 arithmetic steps

  For all 341 study patients in one pass: 341 x 960 = 327,360 operations. A room of tireless
  clerks handles this before lunch. A single pencil would take months.


  ## Why Plain Stacking Falls Apart

  IN HAND: Room 1 built (16 clerks, 30 inputs, 496 dials). For one patient, Room 1 does
  480 multiplications + 480 additions = 960 steps. The zero-out rule will bend negatives.
  This section shows what happens when two plain rooms are stacked without that bend.

  Now stack two rooms. Room 1 feeds 16 numbers to Room 2, which has 8 clerks. Each Room 2
  clerk connects to all 16 inputs from Room 1.

  If both rooms do nothing but multiply and add, something quietly catastrophic happens.

  Watch just two measurements and two clerks for simplicity.
  Clerk A in Room 1 computes: zA = (m1 x w1) + (m2 x w2)
  Clerk B in Room 2 takes zA and applies its dial v: zB = zA x v

  Substituting zA:
    zB = ((m1 x w1) + (m2 x w2)) x v
       = (m1 x w1 x v) + (m2 x w2 x v)

  That final expression is exactly what a SINGLE clerk would compute with combined dials
  (w1 x v) and (w2 x v). Two rooms of arithmetic collapsed into one. Add a third room, a
  hundredth -- same thing. The product of any number of dial grids is still just one dial
  grid. Stack a hundred plain rooms and they are arithmetically identical to a single room.

  Read this carefully, because the precise statement matters: stacking plain LINEAR rooms --
  rooms that only multiply and add -- collapses into a single linear transformation. It is
  not that "more layers are useless." It is that more layers are useless WITHOUT something
  non-linear between them. That something is the next section.

  The fix must be a rule that cannot be absorbed into a dial multiplication. Multiplying by
  a dial always produces a smooth straight line through the origin. We need a BEND -- a sharp
  corner no multiplication can reproduce.


  ## Bending Scores at Zero

  Between each room, apply one rule to every score leaving the room:

      If the score is below zero, replace it with zero.
      If the score is zero or above, leave it alone.

  Three examples so there is no ambiguity:

      score = +2.4  ->  keep 2.4
      score = -0.7  ->  write 0
      score =  0.0  ->  write 0   (the boundary counts as non-negative)

  This rule is called ReLU in the literature (Rectified Linear Unit -- more on names at the
  end). I will call it the zero-out rule until then.

  Why does this fix the collapse? The zero-out rule has a sharp corner at zero: flat below,
  diagonal above. No multiplication can produce that corner. The collapse argument required
  every step to be a multiplication; insert the zero-out rule and the argument breaks.

  A clean way to see it: take Z = -3 and scale it. Multiplication gives -3 x 2 = -6. But
  zero-out then multiplication gives max(0, -3) x 2 = 0, not -6. The zero-out rule changed
  the value in a way no dial can undo.

  One failure mode worth knowing. If a clerk's raw score is negative for EVERY study patient,
  that clerk always outputs zero. The next room never hears from it; the clerk is "dead." If
  all 16 clerks in Room 1 die at once, the building is blind. I caused this once by setting
  every starting dial to the same large negative value. Frameworks default to small random
  dials specifically to avoid it.

  Try this: four scores come out of Room 1: [3.1, -0.2, 0.0, -5.8]. Apply the zero-out rule.

  ...

  [3.1, 0, 0, 0]. Only the non-negative score passes; the rest become zero.


  ## Squashing Any Score Into a Probability

  Room 1 bends its outputs and passes them on; Room 2 does the same; the final lonely clerk
  adds everything with its dials and nudge and produces one raw score -- any real number.

  But the doctor wants a PROBABILITY between 0 and 1. The zero-out rule will not do it -- it
  still allows any positive number. We need something that approaches 1 as the score grows
  and 0 as it drops.

  The obvious first try, 1 / |Z|, fails twice: it explodes at Z = 0 (division by zero), and
  it gives the same answer for Z = 100 and Z = -100 -- one screaming malignant, one screaming
  benign, both mapped to the same probability. Useless.

  Here is an approach that works. Instead of the probability p directly, think about the
  ODDS: chance of malignant divided by chance of not.

      p = 0.75 (75% malignant): odds = 0.75 / 0.25 = 3
      p = 0.50 (even money):     odds = 0.50 / 0.50 = 1
      p = 0.25:                  odds = 0.25 / 0.75 = 1/3

  Odds run from 0 to infinity as p runs from 0 to 1 -- a nicer range.

  Now one modelling choice: assume each +1 step in the raw score Z multiplies the odds by a
  fixed amount. Call it e (the natural base, about 2.718). This is a decision, not a law --
  it happens to make the learning calculus in Part 2 clean. So:

      odds = e^Z
      p / (1 - p) = e^Z

  Solve for p. Four lines of algebra:

    p = e^Z - p x e^Z       (multiply both sides by (1 - p) and expand)
    p + p x e^Z = e^Z       (move the p term left)
    p x (1 + e^Z) = e^Z     (factor out p)
    p = e^Z / (1 + e^Z)     (divide)

  Multiply top and bottom by e^(-Z):

    p = 1 / (1 + e^(-Z))

  That is the S-curve. Check it with three values (these hand values are rounded -- I mark
  approximations with "≈"):

    Z = 0:
      e^(-0) = e^0 = 1.   p = 1 / (1 + 1) = 1/2 = 0.50   (raw score zero -> 50/50, on the fence)

    Z = +4:
      e^2 = 2.718 x 2.718 ≈ 7.39.  e^4 ≈ 7.39 x 7.39 ≈ 54.6.   e^(-4) ≈ 1/54.6 ≈ 0.018.
      p = 1 / (1 + 0.018) ≈ 0.982     (strongly malignant)

    Z = -4:
      e^(-(-4)) = e^4 ≈ 54.6.
      p = 1 / (1 + 54.6) ≈ 0.018      (strongly benign)

  As Z grows, p approaches 1; as Z shrinks, p approaches 0; at Z = 0, exactly 0.50. No
  division by zero. A large positive and a large negative give opposite probabilities, as
  they should. This squasher is applied ONLY to the final clerk's output -- never between
  inner rooms, where the zero-out rule does the work.

  Try the squasher for Z = 0 yourself. The formula is 1 / (1 + e^(-Z)). At Z = 0, e^(-0) = ?

  ...

  e^0 = 1. So 1 / (1 + 1) = 1/2 = 0.50. Exactly on the fence.


  ## Walking One Patient Through, Start to Finish

  The building is complete on paper. Let me walk one patient through it by hand.

  I will use a toy building: 2 measurements, 2 clerks in Room 1, 1 final clerk. A real
  building has 30 measurements and 16 clerks in Room 1 -- the arithmetic is identical, just
  more rows. The toy keeps the numbers on one page.

  Patient on the desk. Two humbled measurements: [0.5, -1.2]. True answer: 1 (malignant).

  ROOM 1 has 2 clerks.
    Clerk 1: dials [2.0, 1.0], nudge 0.1.
    Clerk 2: dials [-1.0, 0.5], nudge -0.2.

  --- Clerk 1 ---
      first  0.5 x 2.0  =  1.0
      then  -1.2 x 1.0 = -1.2
      add    1.0 + (-1.2) = -0.2
      nudge -0.2 + 0.1 = -0.1
      Clerk 1 raw score: -0.1

  --- Clerk 2 ---
      first  0.5 x (-1.0) = -0.5
      then  -1.2 x 0.5   = -0.6
      add   -0.5 + (-0.6) = -1.1
      nudge -1.1 + (-0.2) = -1.3
      Clerk 2 raw score: -1.3

  Raw paper out of Room 1: [-0.1, -1.3]

  --- Zero-out rule ---
    -0.1 < 0  ->  write 0
    -1.3 < 0  ->  write 0
    Paper entering Room 2: [0, 0]

  Both clerks went negative; both became zero. This is the dead-clerk situation live. Room 2
  receives all zeros. The building is not broken -- the final clerk's nudge still carries a
  signal -- but this patient was erased by Room 1. If it happened to every patient we would
  have a real problem; here it happens to this one patient with these specific random dials.
  After studying (Part 2), the dials adjust so patients get non-zero signals through. We are
  only watching the first forward pass.

  --- Final clerk ---
      Room 2 has 1 clerk: dials [3.0, -2.0], nudge 0.5.
      first  0 x 3.0  = 0
      then   0 x (-2.0) = 0
      add    0 + 0 = 0
      nudge  0 + 0.5 = 0.5
      Final raw score: 0.5

  --- S-curve ---
    p = 1 / (1 + e^(-0.5))
    e^0.5 is sqrt(e). e ≈ 2.718.  1.649 x 1.649 ≈ 2.719, so e^0.5 ≈ 1.649, e^(-0.5) ≈ 0.607.
    1 + 0.607 = 1.607.   1 / 1.607 ≈ 0.622.

  Building's verdict: 0.622, or about 62.2% chance malignant. True answer: 1. Leaning the
  right way -- just not confidently yet. After studying (Part 2), the verdict on this kind
  of patient will sharpen.

  --- Your turn ---

  Same toy building. New patient: three humbled measurements [1, 2, 3].
  Room 1 has 2 clerks with 3 dials each:
    Clerk 1: dials [0.1, 0.2, 0.1], nudge 0.5.
    Clerk 2: dials [-0.1, 0.0, 0.2], nudge 0.0.
  Compute CLERK 2's raw score, line by line.

  ...

    first  1 x (-0.1) = -0.1
    then   2 x 0.0    =  0.0
    then   3 x 0.2    =  0.6
    sum:   -0.1 + 0.0 + 0.6 = 0.5
    add nudge: 0.5 + 0.0 = 0.5
  Clerk 2's raw score = 0.5.

  Clerk 1 would give (1x0.1)+(2x0.2)+(3x0.1)+0.5 = 0.1+0.4+0.3+0.5 = 1.3. Neither is
  negative, so the zero-out rule keeps both. A good sign.


  ## How Wrong Was That Guess?

  IN HAND: patient [0.5, -1.2], true answer 1. Room 1 raw scores [-0.1, -1.3]; zero-out
  gave [0, 0]; final clerk raw score 0.5; S-curve gave p = 0.622.
  This section measures the wrongness of guessing 0.622 when the truth is 1.

  We have a guess (0.622) and a true answer (1). We need to measure the wrongness -- both to
  report it and, in Part 2, to know which way to turn the dials.

  The obvious ruler -- square the difference -- treats probabilities like plain numbers and
  undersells confident wrong answers. A machine that says 0.001 (99.9% sure benign) on a
  malignant patient should be punished far more than one that says 0.4. Squaring barely
  separates them: (0.001 - 1)^2 ≈ 0.998 versus (0.4 - 1)^2 = 0.36 -- a factor of three, when
  the first is a catastrophe and the second is a near-miss.

  What we want: a wrongness that grows without limit as the machine gets more confident in
  the wrong direction. The natural logarithm does exactly this.

  When the true answer is 1: loss = -ln(guess).

    guess = 0.999:  loss = -ln(0.999) ≈ 0.001   (nearly right, tiny loss)
    guess = 0.622:  loss = -ln(0.622) ≈ ?
    guess = 0.100:  loss = -ln(0.100) ≈ 2.303   (badly wrong)
    guess = 0.001:  loss = -ln(0.001) ≈ 6.908   (catastrophic)

  For our patient (guess 0.622, true 1). I know ln(0.5) = -0.693 and ln(1.0) = 0. A finer
  anchor: ln(0.6) ≈ -0.511, and 0.622 is just above 0.6, so ln(0.622) ≈ -0.475.
  loss = -(-0.475) = 0.475. (The exact value is 0.4748 -- my hand estimate is good to three
  places. I am estimating logs by anchoring between values I know; do not mistake these for
  exact figures.)

  When the true answer is 0, the formula flips: loss = -ln(1 - guess), punishing high
  guesses on benign patients.

  Try this: a benign patient (true 0). Building guesses 0.30. Loss?

  ...

  loss = -ln(1 - 0.30) = -ln(0.70). Anchoring: ln(0.5) = -0.693, ln(1.0) = 0, 0.7 is 40% of
  the way up, so ln(0.7) ≈ -0.357. loss ≈ 0.357. The building said 30% malignant on a well
  patient -- correct direction, wrong confidence, moderate loss.

  The full wrongness across all 341 study patients is the average of these individual losses.
  As the dials improve, this average falls. Watching it fall -- alongside the separate
  practice-pile loss -- is how we know the building is learning.

  And HOW the dials change to make that loss fall is the whole of Part 2.


  ## Standard Names for Part 1

  Plain terms used above, translated to the labels you will meet in papers and docs:

    Plain term                         Standard label
    ---------------------------------  -------------------------------------------
    building full of clerks            neural network / deep learning
    room of clerks                     hidden layer
    final lonely clerk                 output neuron
    dial                               weight (W)
    fixed nudge                        bias (b)
    dial grid                          weight matrix
    study / practice / sealed exam     train / validation / test
    humbling the numbers               feature scaling / StandardScaler
    three-cut split                    train-validation-test split
    zero-out rule                      ReLU (Rectified Linear Unit)
    S-curve squasher                   sigmoid activation
    wrongness of one guess             binary cross-entropy loss
    passing a patient through          forward propagation

  (The learning words -- gradient, backpropagation, learning rate, Adam, dropout -- arrive
  in Part 2, where they belong.)


  ## Code, If You Want It

  Nothing above needed a computer: every step was pencil arithmetic a tireless clerk could do.
  This section is for the day you meet one.

  Here is the forward half in Python: load, split, scale, build, and the by-hand forward
  pass in code so you can check the 0.622 we computed above. The training call lives in
  Part 2.

  >> NEW TO PYTHON? Each function named once:
       load_breast_cancer()              -- the 569-patient, 30-measurement dataset
       train_test_split(test_size=0.20)  -- seal 20% as the exam pile
       scaler.fit_transform(X_train)     -- learn the ruler from the study pile AND apply it
       scaler.transform(X_val)           -- apply the SAME ruler without re-learning
       Dense(16, activation='relu')      -- Room 1: 16 clerks with the zero-out rule
       Dense(1, activation='sigmoid')    -- final clerk with the S-curve squasher

    import numpy as np
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Dropout

    SEED = 42

    data = load_breast_cancer()          # 569 patients, 30 measurements
    X, y = data.data, data.target

    # Three honest piles: 60% study, 20% practice, 20% exam
    X_temp, X_test,  y_temp, y_test  = train_test_split(X, y,
                                            test_size=0.20, random_state=SEED)
    X_train, X_val,  y_train, y_val  = train_test_split(X_temp, y_temp,
                                            test_size=0.25, random_state=SEED)

    # Humble the columns -- fit on the study pile ONLY
    scaler = StandardScaler()
    X_train_s = scaler.fit_transform(X_train)  # learn mean/spread from study + apply
    X_val_s   = scaler.transform(X_val)        # same ruler, applied blindly
    X_test_s  = scaler.transform(X_test)       # same ruler, exam never peeked

    # Build the building: Room 1 (16 clerks), Room 2 (8 clerks), Room 3 (1 clerk)
    model = Sequential([
        Dense(16, activation='relu', input_shape=(X_train_s.shape[1],)),
        Dropout(0.2),                          # coffee break -- explained in Part 2
        Dense(8,  activation='relu'),
        Dropout(0.2),
        Dense(1,  activation='sigmoid'),       # S-curve exit: output is a probability
    ])

    # The by-hand forward pass, in code (reproduces the 0.622 from above)
    def forward_pass_by_hand(X, weights):
        A = X.copy()
        for i in range(0, len(weights) - 2, 2):
            W, b = weights[i], weights[i+1]
            Z = A @ W + b
            A = np.maximum(0.0, Z)             # zero-out rule between rooms
        W, b = weights[-2], weights[-1]
        Z = A @ W + b
        Z = np.clip(Z, -80, 80)                # clip before the S-curve (why: Part 2)
        return 1.0 / (1.0 + np.exp(-Z))        # S-curve at the exit

  The model is built but not yet trained -- its dials are still random. Teaching those dials
  to fall down the wrongness hill, worked by hand with the chain rule, is Part 2.

  --> Continue: Chapter 7, Part 2: How a Network Learns


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 7 -- Building a Neural Network from Scratch):
    Part 1 (this post) .
    Part 2 -- How a Network Learns

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================