==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 1 OF 6
  Looking at a Sheet With No Answers: Means, Distance, and the Ruler Problem
  Posted: 2026-06-09 . Author: Rahul Rai . Tags: eda, distance, standardization, unsupervised
  ============================================================================================

  PATH . post 16 of 28  (start of the unsupervised-learning chapter)
    <- prev:  Chapter 5, Part 3: Committees
       next:  Chapter 6, Part 2: The Strongest Direction ->

  Every post so far had an answer column to check against -- house price, car mileage,
  sick or well, diabetes score.  This post has NO answer column at all.  Just 50 states
  and 3 measurements.  No right answer means no "did the machine get it right?"  Instead
  the question is: which states look like each other?  Which ones are strange?

  The tools are simple: mean, spread, and a ruler that measures the gap between two rows.
  The trap is also simple: one column measured in hundreds shouts louder than one measured
  in tens, and the gap gets hijacked by the loudest column unless you put every column on
  the SAME ruler first.

  ## The Sheet

    State       Murder    Assault    UrbanPop
    -------     ------    -------    --------
    Alabama      13.2       236         58
    Alaska       10.0       263         48
    Arizona       8.1       294         80
    Arkansas      8.8       190         50
    California    9.0       276         91
    Colorado      7.9       204         78
    ...           ...       ...        ...
    Wyoming       6.8       161         60

    50 states (rows).  3 measurements (columns).  Zero answer columns.

    Murder    = murder arrests per 100,000 people
    Assault   = assault arrests per 100,000 people
    UrbanPop  = percent of people living in cities

  No right answer.  The goal: find which states are most alike.

  ## The Middle and the Spread

  For each column, compute two numbers -- the middle (mean) and the spread
  (standard deviation).  Add up all 50 values, divide by 50 = mean.
  For the spread: measure each value's gap from the mean, square it, average
  the squares, take the square root.

  With the full sheet in front of you (all 50 states):

    column      mean      spread
    -------     -----     -------
    Murder       7.79      4.36
    Assault    170.76     83.34
    UrbanPop    65.54     14.47

  Assault's mean is ~171 -- twenty-two times larger than Murder's mean (170.76 / 7.79
  ~= 21.9).  Its spread is ~83 -- also twenty times larger (83.34 / 4.36 ~= 19.1).
  Already a warning light: Assault's numbers are much bigger.  It will dominate any
  gap measurement unless we intervene.

  ## The Ruler Problem

  Look at the ranges (max - min):

    Murder:    17.4 - 0.8  = 16.6
    Assault:  337 - 45     = 292
    UrbanPop:  91 - 32     =  59

  Assault has the widest range by far.  When you measure "how far apart are two
  states", the Assault gap can be up to 290 units, while the Murder gap is at most 16
  units.  The Assault column shouts 10 times louder than Murder -- and the slate says
  even more: 292 / 16.6 ~= 17.6.  The gap number mostly reflects Assault, not the
  full picture.

  ## The Gap Between Two States: Two Rulers

  IN HAND: a sheet of 50 states x 3 columns, a middle and a spread for each column,
  and a warning: Assault's range (337 - 45 = 292) dwarfs Murder's (17.4 - 0.8 = 16.6).
  This section adds the ruler itself -- two ways to measure the gap between two rows.

  You have two states with three numbers each.  How far apart are they?

  --- STRAIGHT-LINE GAP (Euclidean distance) ---

    gap = sqrt( (m1 - m2)^2 + (a1 - a2)^2 + (u1 - u2)^2 )

  Square each difference so negative and positive gaps don't cancel.  Add them.
  Take the square root to get back to the original units.

  --- CITY-BLOCK GAP (Manhattan distance) ---

    gap = |m1 - m2| + |a1 - a2| + |u1 - u2|

  No squaring.  No square root.  Just absolute differences added up.
  Called "city-block" because it measures like walking around a grid
  instead of cutting diagonally through the block.

  ## A Worked Example: Alabama vs Alaska, by Hand

    Alabama:  Murder=13.2   Assault=236   UrbanPop=58
    Alaska:   Murder=10.0   Assault=263   UrbanPop=48

  STRAIGHT-LINE:

    Murder gap:  13.2 - 10.0 = 3.2     squared = 3.2 * 3.2 = 10.24
    Assault gap: 236 - 263 = -27       squared = 27 * 27 = 729
    UrbanPop gap: 58 - 48 = 10         squared = 10 * 10 = 100

    sum of squares: 10.24 + 729 + 100 = 839.24
    sqrt:  sqrt(839.24) ~= 28.97

    Straight-line gap = 28.97

  CITY-BLOCK:

    Murder gap:  |13.2 - 10.0| = 3.2
    Assault gap: |236 - 263| = 27
    UrbanPop gap: |58 - 48| = 10

    sum: 3.2 + 27 + 10 = 40.2

    City-block gap = 40.2

  The city-block gap is always larger than the straight-line gap for the same
  two points (the diagonal is shorter than the L-shaped walk).

  >> YOUR TURN
     Two more states (made-up):  X has Murder 6.0, Assault 200, UrbanPop 50;  Y has
     Murder 9.0, Assault 240, UrbanPop 60.  Work BOTH gaps on the slate.

     check your slate:  STRAIGHT-LINE -- gaps 6 - 9 = -3 (sq 9), 200 - 240 = -40
     (sq 1600), 50 - 60 = -10 (sq 100);  sum 9 + 1600 + 100 = 1709;  sqrt(1709)
     ~= 41.3.  CITY-BLOCK -- 3 + 40 + 10 = 53.  City-block (53) tops straight-line
     (41.3), as the L-walk always does -- and Assault's 1600 drowns the other two.

  ## The Sheet of Gaps

  Now repeat the straight-line gap for ALL pairs of states.  50 states -> 50x50
  sheet where each cell holds the gap between state i and state j:

           AL     AK     AZ     AR     CA     CO    ...
    AL   [ 0    28.97  ...    ...    ...    ...   ]
    AK   [28.97   0     ...    ...    ...    ...   ]
    AZ   [ ...   ...     0     ...    ...    ...   ]
    AR   [ ...   ...    ...     0     ...    ...   ]
    CA   [ ...   ...    ...    ...     0     ...   ]
    CO   [ ...   ...    ...    ...    ...     0    ]
    ...

    Diagonal = 0 (a state is zero distance from itself).
    Symmetric: AL-AK = AK-AL.

  This sheet IS the foundation of clustering.  Every grouping method starts
  from some version of "who is closest to whom."

  ## The Same Ruler: Standardisation

  The gap between Alabama and Alaska (28.97) is heavily influenced by Assault
  because Assault's numbers are huge (236 vs 263 = 27 gap).  The Murder gap
  (3.2) barely registers.

  Fix: put every column on the SAME ruler before measuring gaps.

    standardised value = (raw value - column mean) / column spread

    Murder:  mean=7.79,  spread=4.36
    Assault: mean=170.76, spread=83.34
    UrbanPop: mean=65.54, spread=14.47

  Alabama after standardisation:

    Murder z:   (13.2 - 7.79) / 4.36   = 5.41 / 4.36  = 1.24
    Assault z:  (236 - 170.76) / 83.34  = 65.24 / 83.34 = 0.78
    UrbanPop z: (58 - 65.54) / 14.47    = -7.54 / 14.47 = -0.52

  Alaska after standardisation:

    Murder z:   (10.0 - 7.79) / 4.36   = 2.21 / 4.36  = 0.51
    Assault z:  (263 - 170.76) / 83.34  = 92.24 / 83.34 = 1.11
    UrbanPop z: (48 - 65.54) / 14.47    = -17.54 / 14.47 = -1.21

  Now every column has mean ~0 and spread ~1.  No column shouts louder.

  Standardised Alabama:
    Murder: 1.24    Assault: 0.78    UrbanPop: -0.52

  Standardised Alaska:
    Murder: 0.51    Assault: 1.11    UrbanPop: -1.21

  ## The Fair Gap (After Same-Ruler)

  Straight-line gap on standardised numbers:

    Murder gap:  1.24 - 0.51 = 0.73       squared = 0.73 * 0.73 = 0.53
    Assault gap: 0.78 - 1.11 = -0.33      squared = 0.33 * 0.33 = 0.11
    UrbanPop:   -0.52 - (-1.21) = 0.69    squared = 0.69 * 0.69 = 0.48

    sum of squares: 0.53 + 0.11 + 0.48 = 1.12
    sqrt:  sqrt(1.12) ~= 1.06

    Fair gap = 1.06

  The raw gap was 28.97.  The fair gap is 1.06.  The raw gap was dominated by
  Assault's units (hundreds).  The fair gap treats every column equally.

  >> YOUR TURN
     A state (made-up) has Murder = 16.0.  Murder's mean is 7.79 and its spread is
     4.36.  Put that one number on the same ruler.

     check your slate:  z = (16.0 - 7.79) / 4.36 = 8.21 / 4.36 ~= 1.88.  The state
     sits about 1.88 spreads ABOVE the average murder rate -- a high-murder state,
     now in plain ruler-units any other column can be compared against.

  ## The Closest Pair

  Build the 50x50 fair gap sheet (standardised distances instead of raw).
  Find the smallest non-zero entry.

  The two states with the smallest gap = most similar state-pair.  (Which pair?
  That is for you to look up from the full sheet.)

  ## Ask the Closest Neighbors

  For any state, you can find its k nearest neighbors -- the k states with the
  smallest fair gaps to it.  For California's 3 nearest neighbors, your paper
  would list the three states with the smallest standardised gaps to California.

  This is the same "ask the closest rows" idea from Chapter 1 (k-nearest
  neighbours), but with NO answer column.  You just list who is nearby.
  That is the core of unsupervised learning -- looking at neighbours without
  a right answer to check.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  The raw distance is not "wrong," just unfair.
       The raw Alabama-Alaska gap (28.97) is mathematically correct but
       practically misleading -- Assault's large units dominate it.
       Standardisation is not optional for distance-based methods.

    TRIPWIRE 2:  ddof=0 vs ddof=1 for spread.
       The assignment says ddof=0 (population standard deviation).
       This matches sklearn's StandardScaler.  ddof=1 gives a slightly
       different number (sample standard deviation) that does not match
       the scaler.  When standardising, the formula must agree.

    TRIPWIRE 3:  Missing index_col=0 when loading.
       If you forget index_col=0, the state names become a regular column
       instead of row labels.  The shape is still (50, 4) instead of (50, 3)
       and the distance calculations include the state name column.

    TRIPWIRE 4:  Euclidean vs Manhattan -- both are valid, different uses.
       Euclidean is the default but Manhattan handles outliers better
       (no squaring means no single large gap dominates).  Neither is
       "right" or "wrong" -- they measure different kinds of distance.

    TRIPWIRE 5:  The diagonal is zero but the minimum should not be zero.
       When searching for the closest pair, you must exclude the diagonal
       (state-to-self distance = 0).  Set diagonal entries to infinity
       before searching for the minimum.

    TRIPWIRE 6:  Manhattan >= Euclidean for the same two points.
       The straight-line diagonal is the shortest path.  City-block (walking
       the grid) is always longer or equal.  If your Manhattan is smaller
       than Euclidean, you made an arithmetic mistake.

    TRIPWIRE 7:  Standardising changes the relative ordering of distances.
       Two states that were closest in raw numbers might not be closest
       after standardisation.  A column with large range dominated the
       raw gap; after standardisation, a different column might matter more.
       The ordering changes.  That is expected, not a bug.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    sheet with no answer column           unlabelled data / unsupervised learning
    middle of a column                    mean / average
    spread of a column                    standard deviation (std / sigma)
    range (max - min)                     feature range
    straight-line gap                     Euclidean distance (L2)
    city-block gap                        Manhattan distance (L1)
    sheet of gaps                         pairwise distance matrix
    same ruler / standardised             z-score / StandardScaler
    fair gap                              standardised Euclidean distance
    closest pair (smallest gap)           nearest neighbours
    closest k states to anchor            k-nearest neighbours (KNN lookup)
    column                                 feature / variable
    row                                    observation / sample


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       pd.read_csv()         -- load a sheet from a .csv file
       df.shape              -- (rows, columns)
       df.columns.tolist()   -- list of column names
       df.mean()             -- mean of every column
       df.std(ddof=0)        -- population standard deviation
       df.max() - df.min()   -- range per column
       StandardScaler()      -- put every column on mean=0, spread=1
       np.sqrt()             -- square root

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import StandardScaler

    # load the sheet
    df = pd.read_csv("us_arrests.csv", index_col=0)
    print(df.shape)              # (50, 3)
    print(df.columns.tolist())   # ['Murder', 'Assault', 'UrbanPop']

    # summary statistics
    means = df.mean().round(2).to_dict()
    stds  = df.std(ddof=0).round(2).to_dict()
    ranges = (df.max() - df.min()).round(2).to_dict()

    # Euclidean distance function
    def euclidean_distance(x, y):
        return float(np.sqrt(np.sum((x - y) ** 2)).round(3))

    # Manhattan distance function
    def manhattan_distance(x, y):
        return float(np.sum(np.abs(x - y)).round(3))

    # distance between Alabama and Alaska
    al = df.loc["Alabama"]
    ak = df.loc["Alaska"]
    print(euclidean_distance(al, ak))    # 28.97
    print(manhattan_distance(al, ak))    # 40.2

    # standardise
    scaler = StandardScaler()
    scaled = scaler.fit_transform(df)
    df_scaled = pd.DataFrame(scaled, index=df.index, columns=df.columns)

    # fair distance matrix (50x50)
    n = len(df_scaled)
    dist_mat = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            dist_mat[i, j] = euclidean_distance(df_scaled.iloc[i],
                                                df_scaled.iloc[j])

    # closest pair (exclude diagonal)
    np.fill_diagonal(dist_mat, np.inf)
    min_idx = np.unravel_index(dist_mat.argmin(), dist_mat.shape)
    pair = (df_scaled.index[min_idx[0]], df_scaled.index[min_idx[1]])

    # k-nearest neighbours
    def find_k_nearest(state_name, k):
        idx = df_scaled.index.get_loc(state_name)
        row = dist_mat[idx].copy()
        closest = np.argsort(row)[:k]
        return [df_scaled.index[i] for i in closest]


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
    Part 1 (this post) .
    Part 2 -- The Strongest Direction (PCA) .
    Part 3 -- Grouping by Nearest Centre (K-Means) .
    Part 4 -- The Family Tree (Hierarchical Clustering) .
    Part 5 -- Both Tools on NCI60 (Re-visited) .
    Part 6 -- Filling the Blanks (Recommender Systems)

  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================