Leash and Cloud: L2 Punishment and the Two-Cloud Wall

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 3 . SORTING INTO BINS . PART 3 OF 4
  Leash and Cloud: L2 Punishment and the Two-Cloud Wall
  Posted: 2026-06-05 . Author: Rahul Rai . Tags: l2-regularization, lda, generative-models
  ============================================================================================

  PATH . post 8 of 28
    <- prev:  Sorting 2: The Trade Curve
       next:  Sorting 4: Picking Settings, Skewed Piles ->

  An over-confident machine is a dangerous thing, and the S-curve machine from Part 1 --
  one dial per column, multiply and add, squash the sum into a 0-to-1 chance -- has
  a streak of it. It sets its dials by rolling downhill on the cross-entropy leftover
  (the score that charges it for each lump by how confidently wrong its chance was), and
  left to its own devices nothing stops a single dial from ballooning to 50 if that is
  what makes the study pile fit. A dial that large means the machine has fallen for one
  column and stopped listening to the other twenty-nine -- it aces the practice exam and
  then freezes on a lump it has never seen. Confident, and wrong, which is the worst way
  to be.

  This post is about humbling it, two different ways. The first keeps the same machine but
  puts it on a TIGHTER leash. The second throws the machine out entirely and tries a
  completely different temperament -- one that never rolls downhill at all, but instead
  steps back, looks at the SHAPE of the two groups, and draws the wall in a single stroke.

  >> NOTE: THE PART 1 BASELINE ALREADY HAS A LEASH
     Worth saying up front, because it surprises people: LogisticRegression() in sklearn is
     NOT a free machine. Its defaults are penalty='l2', C=1.0 -- so the Part 1 baseline
     already carries an L2 leash of medium strength. This post is therefore not "no leash
     vs leash"; it is MEDIUM LEASH (C=1.0) vs TIGHTER LEASH (C=0.1). To see a genuinely
     unleashed machine you would have to pass penalty=None explicitly.


  ## What a Loose Leash Lets Happen

    default machine (C=1.0):  medium leash -> dials kept modest, but a dominant
                              column can still pull its dial fairly large

    truly free (penalty=None): no leash -> one dial might reach 40 or -60
    patient row -> *dials -> add -> squash -> chance

    if column 7's dial = 40:
      a shift of 0.1 in column 7 swings the sum by 40 x 0.1 = 4.0
      -> chance jumps from 0.1 to 0.98
    the machine bets almost everything on one column
    -> cocky on study pile, shaky on new lumps

  So the question this post really asks is: the default C=1.0 leash is fine, but what
  happens if we pull it TIGHTER? Do the dials shrink further, and does the machine get more
  humble?


  ## A Leash on the Dials (L2 Penalty)

  Add a price tag to large dials. Every extra unit of dial size costs something. The
  machine now minimises two things at once: fit the study pile well AND keep the dials
  small. The penalised objective is:

      L_L2(beta) = L(beta) + lambda * sum_j beta_j^2

  The first term is the cross-entropy from Part 1 -- the plain-fit score: for each lump,
  take the chance the machine gave the TRUE bin, and fine it by -log of that chance, so a
  confident wrong chance costs a lot and a hedged one costs little.  The second is the sum
  of squared dials times a strength lambda. When lambda is large, even a small dial^2 sum gets expensive,
  and the machine is forced to shrink everything toward zero. The gradient update picks up
  a pull-toward-zero term:

      dL_L2/dbeta_j = (1/n) sum_i ( sig(z_i) - y_i ) x_ij  +  2*lambda*beta_j

  The term 2*lambda*beta_j pulls every dial toward zero at every step. Result: dials spread
  across all 30 columns instead of concentrating on one.

  >> NOTE: BAYESIAN READING
     Adding a squared-dial penalty is the same as placing a zero-mean Gaussian prior on
     each dial and maximising the posterior instead of the likelihood. The prior says "a
     dial as large as 40 is very surprising; please explain." The data can override this
     prior if the evidence is strong enough; otherwise the dials stay modest.


  ## The C Parameter: Counter-Intuitive Direction

  sklearn spells the leash strength as C, not lambda. They are reciprocals:

      C = 1 / lambda   <=>   lambda = 1 / C

    C = 0.1   ->   lambda = 10    heavy penalty -- dials squeezed hard
    C = 1.0   ->   lambda = 1     sklearn default
    C = 1000  ->   lambda = 0.001 barely any penalty -- nearly free machine

    check each on the slate:  1/0.1 = 10;  1/1.0 = 1;  1/1000 = 0.001

  !! WARN: C=0.1 IS HEAVY PUNISHMENT, NOT LIGHT
     C is the budget you give the machine to IGNORE the penalty. A small budget means
     little room to ignore it -- a hard squeeze. A large C means a large "ignore" budget --
     the penalty barely bites. "C small = heavy leash" is counter-intuitive until you
     remember C = 1/lambda.


  ## Did the Dials Shrink?

  IN HAND: one S-curve machine, its leftover now two-part -- plain fit plus a fine of
  lambda times the sum of squared dials -- and the dictionary C = 1/lambda, so the tight
  setting C = 0.1 means lambda = 1/0.1 = 10.  This section adds the receipt: proof the
  dials actually shrank.

  Compare the average absolute dial size for the default leash (C=1.0) against the tighter
  leash (C=0.1): if the tighter leash is doing its job, its average dial comes out smaller.
  (The two-line check is in the code at the end of the post.)

  A concrete 3-column example, by pencil.  Suppose only 3 columns -- x1, x2, x3 --
  and two machines trained on the same data:

    dial        free (no leash)   default C=1.0   tight C=0.1
    ------------------------------------------------------------
    x1 dial       +8.2               +3.1            +1.4
    x2 dial       -5.7               -2.4            -0.9
    x3 dial       +0.3               +0.2            +0.1

    |dial| average:
      free:   (8.2 + 5.7 + 0.3) / 3  =  14.2 / 3  =  4.73
      C=1.0:  (3.1 + 2.4 + 0.2) / 3  =   5.7 / 3  =  1.90
      C=0.1:  (1.4 + 0.9 + 0.1) / 3  =   2.4 / 3  =  0.80

    The free machine lets x1 balloon to 8.2 -- it bets heavily on one
    column.  C=1.0 pulls it to 3.1.  C=0.1 pulls it to 1.4.  The tighter
    the leash, the more the machine spreads trust across all 3 columns.

  >> YOUR TURN
     Charge the FREE machine's dials (+8.2, -5.7, +0.3) the fine the tight leash
     charges: lambda = 10 times the sum of squared dials.  Work it on the slate
     before reading on.

     check your slate:  8.2^2 = 67.24;  5.7^2 = 32.49;  0.3^2 = 0.09;  sum = 67.24 +
     32.49 + 0.09 = 99.82;  fine = 10 x 99.82 = 998.2.  The tight machine's own dials
     cost only 10 x (1.4^2 + 0.9^2 + 0.1^2) = 10 x (1.96 + 0.81 + 0.01) = 10 x 2.78 =
     27.8.  Ballooned dials cost about 36 times as much -- that is exactly the pressure
     that makes the machine shrink them.

  The absolute value is needed because a +3 and a -3 cancel in a plain
  average, making the machine look like it has no signal at all.  The
  absolute value measures SIZE regardless of sign.

  >> NOTE: WHY ABSOLUTE VALUE BEFORE AVERAGING
     Dials can be positive or negative. A positive dial +3 and a negative dial -3 cancel to
     zero in a plain average, making the machine look like it has no signal at all. Taking
     the absolute value first measures the SIZE of each dial regardless of sign, then
     averages those sizes. What you want to compare is pull strength, not direction.


  ## Scaling and the Leash Are Not the Same Fix

    Scaling (put columns on one ruler):  fixes the INPUTS
    L2 leash:                            fixes the DIALS

    Even after scaling, a dial can grow huge if the machine
    over-trusts one column. You need both. They solve different problems.

  Scaling makes the 30 input columns comparable before the machine sees them. The leash
  limits how large any individual dial grows during the dial-setting step. Removing either
  one leaves a different problem unsolved.


  ## A Completely Different Machine: The Two-Cloud Wall

  IN HAND: the leashed S-curve machine -- fit plus lambda times squared dials, tight
  setting C = 0.1 (lambda = 1/0.1 = 10) -- and the receipt that its dials shrank: average
  size (1.4 + 0.9 + 0.1)/3 = 2.4/3 = 0.80 against the default's (3.1 + 2.4 + 0.2)/3 =
  5.7/3 = 1.90.  This section adds a second machine of the opposite temperament.

  Everything so far kept the same machine and tightened its leash. Now we change the
  machine itself. The S-curve machine is a fidgeter -- it inches toward the answer by trial
  and error: adjust, check the leftover, adjust again. Linear discriminant analysis (LDA)
  has the opposite personality. It does not fiddle at all. It stands back, studies the
  SHAPE of the two groups of points, and lays down the wall between them in a single
  confident stroke.

    first  split the study pile into sick rows and well rows
    then   compute 30 averages per group -> two centres in 30-column space
           (60 averages total: 30 per class, 2 class centres)
    then   find the wall between the two centres
    so     a new lump -> which side of the wall? -> that is the label

    well centre #----------+----------# sick centre
                           ^
                         wall here


  ## Fisher's Criterion: Where to Aim the Wall

  A naive wall at the midpoint between the two centres works when both clouds are round.
  Real clouds are stretched -- some directions have more spread than others, and the two
  classes may share some of that stretch. Fisher's criterion asks: which direction w
  maximises the RATIO of between-class spread to within-class spread?

      maximise  (w^T S_b w) / (w^T S_w w)

  where S_b = (mu1 - mu0)(mu1 - mu0)^T (between-class scatter) and S_w = S_W (pooled
  within-class scatter). The solution is closed-form -- no rolling downhill required:

      w  is proportional to  S_W^-1 (mu1 - mu0)

  Meaning: take the difference between the two class centres, then rotate it by the inverse
  of the pooled within-class spread matrix. This adjusts for the tilt and shape of the
  clouds -- if both clouds are elongated diagonally, the wall tilts to match.

  Where does the wall sit along that direction? If the two classes are equally common, it
  sits exactly at the midpoint of the projected class means:

      threshold = w^T (mu0 + mu1) / 2      (only when the two classes are equally common)

  A concrete 2-column, 4-person LDA walkthrough, by pencil.

    Only 2 columns (bmi and bp) and 4 people:

      person   bmi    bp     truth
      -----------------------------
        A      0.04   0.90   sick (1)
        B      0.06   0.85   sick (1)
        C      0.12   0.50   well (0)
        D      0.18   0.45   well (0)

    First, the 60 averages (2 class means x 2 columns = 4 averages):

      sick class (A, B):  mu1 = ( (0.04+0.06)/2 , (0.90+0.85)/2 ) = (0.05 , 0.875)
      well class (C, D):  mu0 = ( (0.12+0.18)/2 , (0.50+0.45)/2 ) = (0.15 , 0.475)

    Then the midpoint between the two centres (the IMAGINARY point):

      midpoint = ((0.05+0.15)/2 , (0.875+0.475)/2 ) = (0.10 , 0.675)

    Then the difference between centres:

      mu1 - mu0 = (0.05-0.15 , 0.875-0.475) = (-0.10 , 0.400)

    Then the wall NORMAL vector w (ignoring S_W for this clean round-cloud
    picture; with equal covariances w = mu1 - mu0):

      w = (-0.10 , 0.400)

    Finally, project each person onto w and compare to the midpoint projection:

      w^T midpoint = -0.10*0.10 + 0.400*0.675 = -0.01 + 0.270 = 0.260

      A: w^T x = -0.10*0.04 + 0.400*0.90 = -0.004 + 0.360 = 0.356  > 0.260 -> sick
      B: w^T x = -0.10*0.06 + 0.400*0.85 = -0.006 + 0.340 = 0.334  > 0.260 -> sick
      C: w^T x = -0.10*0.12 + 0.400*0.50 = -0.012 + 0.200 = 0.188  < 0.260 -> well
      D: w^T x = -0.10*0.18 + 0.400*0.45 = -0.018 + 0.180 = 0.162  < 0.260 -> well

    All 4 classified correctly.  The wall sits at 0.260 along w.  New
    lump with bmi=0.10, bp=0.70: w^T x = -0.01 + 0.28 = 0.270 > 0.260 -> sick.
    The wall is the midpoint because both classes have equal counts here.

  >> YOUR TURN
     Same wall: w = (-0.10, 0.400), and a lump sits on the sick side when w^T x beats
     0.260.  A new lump walks in (made-up): bmi = 0.08, bp = 0.60.  Score it.

     check your slate:  w^T x = -0.10 * 0.08 + 0.400 * 0.60 = -0.008 + 0.240 = 0.232.
     0.232 < 0.260, so the lump falls on the WELL side of the wall -- called well.

  But the two classes are usually NOT equally common, and that pure midpoint is a special
  case. The full rule scores each new lump and adds a nudge for how common each class is:

      score(x) = w^T x + w0,
      w0 = -1/2 (mu0 + mu1)^T S_W^-1 (mu1 - mu0) + log(pi1 / pi0)

  The first piece of w0 is the midpoint; the extra log(pi1/pi0) term slides the wall toward
  the rarer class so the machine doesn't over-shout it. Here pi0 and pi1 are the class
  frequencies (the priors).

  !! WARN: THE WALL IS NOT EXACTLY HALFWAY ON IMBALANCED DATA
     The Wisconsin sheet is roughly 63% well and 37% sick -- not equal. sklearn's
     LinearDiscriminantAnalysis() uses EMPIRICAL priors by default (the actual class
     frequencies), so the boundary it fits carries the log(pi1/pi0) offset and sits OFF the
     midpoint. If you place a wall at the pure midpoint and expect to reproduce sklearn's
     predictions, you will be off. To get the clean halfway wall, force equal priors:
     LinearDiscriminantAnalysis(priors=[0.5, 0.5]).


  ## Generative vs Discriminative

  The S-curve machine and LDA arrive at the same final form -- a linear boundary through
  the 30-column space -- but they derive it through completely different reasoning:

    Property         S-curve machine                 Two-cloud wall (LDA)
    -------------    ---------------------------     -----------------------------------
    Approach         models P(sick|x) directly       models P(x|sick) and P(sick)
    Solution         iterative (roll downhill)       closed form (one shot)
    Assumption       no distribution on x            Gaussian columns, equal spread
    Breaks when      columns perfectly tangled       spread assumption badly violated
    Works better     large pile, noisy columns       small pile, Gaussian columns

  LDA is a GENERATIVE model: it imagines each class generates Gaussian-distributed data and
  uses Bayes' theorem to flip to P(sick|x). Via that flip, P(sick|x) works out to
  sig(w^T x + w0) -- the same S-curve form as logistic regression. Same decision-boundary
  shape, different route.


  ## Same Accuracy: What It Means

  On the Wisconsin breast cancer sheet, LDA and the S-curve machine give nearly identical
  accuracy. Two completely different approaches, same result. That is not a coincidence:

  ** KEY: AGREEMENT = CLEAN DATA, GENUINELY SEPARABLE CLASSES
     When the two methods agree, the data is telling you the answer. A logistic machine
     that maximises likelihood and an LDA that reads cloud shapes both find the same
     dividing line because that line is clearly written in the data. If the sheet were
     noisy or the two groups heavily overlapping, the two methods would diverge and their
     disagreement would tell you something important: the boundary is ambiguous.


  ## Why Scaling Also Matters for LDA

  LDA computes S_W, the pooled within-class spread matrix. If column "area" runs in the
  thousands while column "smoothness" runs in hundredths, the area column dominates the
  covariance matrix -- its large numbers swamp the matrix entries and distort the
  projection direction w. Putting every column on the same ruler before feeding LDA makes
  S_W well-conditioned and the projection direction meaningful.

    1. The free S-curve machine lets dials grow without limit; L2 adds a squared-dial price
       that shrinks them toward zero.
    2. C = 1/lambda: small C (say 0.1) means large lambda (10) -- heavy squeeze.
       Counter-intuitive.
    3. The leash fixes dials; scaling fixed inputs. They solve different problems; need
       both.
    4. LDA reads the two cloud centres and their shared spread, then finds the best wall
       in one closed-form step.
    5. w prop. S_W^-1 (mu1 - mu0): the difference of centres, rotated by the inverse
       within-class spread.
    6. LDA and the S-curve machine produce the same decision-boundary form; they derive it
       from opposite directions -- generative vs discriminative.
    7. When both machines agree on accuracy, the data is cleanly separable. Disagreement is
       diagnostic.


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  Three small things: fit the tighter-leashed machine, check its dials really did shrink,
  and fit the two-cloud wall (LDA) both ways -- empirical priors and forced-equal priors.

  >> NEW TO PYTHON? Each named once:
       np.abs(x)     -- the size of each number, sign thrown away (NumPy)
       np.mean(x)    -- the average of a row of numbers

  Tighter leash (C=0.1 is a HEAVY squeeze -- remember C = 1/lambda):

    log_reg_l2 = LogisticRegression(penalty='l2', C=0.1, random_state=42)
    log_reg_l2.fit(X_train_scaled, y_train)
    y_pred_l2 = log_reg_l2.predict(X_test_scaled)

  Did the dials shrink? Compare the average absolute dial size, default vs tighter:

    avg_coef_baseline = np.mean(np.abs(log_reg_baseline.coef_))  # C=1.0 (default)
    avg_coef_l2       = np.mean(np.abs(log_reg_l2.coef_))         # C=0.1 (tighter)
    # expect: avg_coef_l2 < avg_coef_baseline (the tighter leash shrinks them further)

  The two-cloud wall, both ways -- sklearn's default uses empirical priors, so force equal
  priors if you want the exact halfway wall:

    # sklearn default: empirical priors -> wall shifted off the midpoint
    lda = LinearDiscriminantAnalysis()
    lda.fit(X_train_scaled, y_train)
    y_pred_lda = lda.predict(X_test_scaled)

    # to reproduce the exact "halfway between the centres" wall:
    lda_equal = LinearDiscriminantAnalysis(priors=[0.5, 0.5])
    lda_equal.fit(X_train_scaled, y_train)


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    leash on the dials                    L2 regularisation / ridge penalty
    dial-size price                       regularisation term lambda*sum(beta_j^2)
    C (sklearn parameter)                 inverse regularisation strength (C = 1/lambda)
    two-cloud midpoint wall               LDA (linear discriminant analysis)
    cloud centre                          class mean mu_k
    pooled within-class spread            within-class scatter matrix S_W
    Fisher's criterion                    maximise (w^T S_b w)/(w^T S_w w)
    models P(x|class)                     generative model
    models P(class|x) directly            discriminative model

----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 3 -- Sorting Into Bins):
    Part 1 -- The S-Curve, the Four-Box Table .
    Part 2 -- The Trade Curve .
    Part 3 (this post) .
    Part 4 -- Picking Settings, Skewed Piles

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================