Filling the Blanks: Recommender Systems and Matrix Factorisation

==============================================================================================
  RAHUL'S ML BLOG -- notes on machine learning, worked out by hand                    est. 2026
==============================================================================================
  home | about | archive | glossary | contact
----------------------------------------------------------------------------------------------

  CHAPTER 6 . FINDING PATTERNS WITHOUT ANSWERS . PART 6 OF 6
  Filling the Blanks: Recommender Systems and Matrix Factorisation
  Posted: 2026-06-09 . Author: Rahul Rai . Tags: recommender-systems, matrix-factorisation, collaborative-filtering
  ============================================================================================

  PATH . post 21 of 28
    <- prev:  Chapter 6, Part 5: Both Tools on NCI60
       next:  Chapter 7, Part 1: How a Network Computes a Guess ->

  The last post in the book.  Every chapter until now had a full sheet of numbers -- maybe
  no answer column, but at least every cell was filled in.  This post starts with a sheet
  that is mostly HOLES, and the whole job is to guess what belongs in the blanks.

  This is the recommender problem: a grid of users down the side, movies across the top,
  and a star rating in a cell only where a user actually rated a movie.  Most cells are
  empty -- nobody watches everything.  Fill a blank well and you can say "you, who has not
  seen this film, will probably rate it 4.5 stars" -- a recommendation.

  ## The Sheet Full of Holes

           Matrix  Titanic  Alien  Notebook  Rambo
    Ann  [   5       .       4        .        . ]
    Ben  [   .       4       .        5        . ]
    Cal  [   4       .       5        .        4 ]
    Dee  [   .       5       .        4        . ]
    Eve  [   5       .       4        .        3 ]

    "." = blank (this user never rated this movie)

  The grid is mostly dots.  The goal: guess the dots.  If we can guess that Ben would rate
  Alien a 4, we recommend Alien to Ben.  There is no separate "answer column" -- the answers
  ARE the scattered ratings themselves, and the blanks are what we predict.

  ## The Old Idea in New Clothes: Hidden Tastes

  Stare at the filled cells and a pattern leaks out.  Ann and Eve both love Matrix and
  Alien -- sci-fi.  Ben and Dee both love Titanic and Notebook -- romance.  Nobody told us
  "sci-fi" and "romance" exist; they fall out of the ratings.  These are HIDDEN TASTES, and
  they are exactly PCA's shadows from Part 2, wearing a new coat.

  The trick (matrix factorisation): every rating is the match between a user's taste and a
  movie's flavour.  Give each user a short list of hidden-taste scores, and each movie the
  same short list of hidden-flavour scores.  A rating is their dot-product:

    rating(user, movie) = (user's taste) matched against (movie's flavour)

       Ann's tastes:    sci-fi 0.9,  romance 0.1
       Alien's flavour: sci-fi 0.8,  romance 0.0
       guess = 0.9 x 0.8 + 0.1 x 0.0 = 0.72

  (The actual numbers the machine finds won't be tidy 0.9 and 0.8; those
   are for illustration.  The optimiser picks whatever scale makes the
   predicted ratings match the true ones -- so if real ratings are 1-5
   the factors end up in a range that produces 1-5 predictions.)

  High when taste lines up with flavour, low when it does not.  That is the whole engine.

  ## Two Skinny Sheets Instead of One Holey One

  Matrix factorisation splits the big holey grid into two SKINNY full sheets -- one for
  users, one for movies -- with a few hidden columns each:

       big holey grid          user tastes        movie flavours
       (users x movies)    =   (users x K)   times  (K x movies)

       5 users x 5 movies  ~=  5 users x 2     x    2 x 5 movies
         (mostly blank)         (full)              (full)

  K is the number of hidden tastes (2 here: sci-fi, romance).  The two skinny sheets are
  full -- no holes.  Multiply them back together and you get a COMPLETE grid: every cell
  filled, including the ones that were blank.  Those filled-in blanks are the predictions.

  This is the same shape PCA used -- a short list of per-row scores multiplied by a shared
  recipe (loadings) rebuilds the full sheet.  PCA crushed a full
  sheet to a smaller one and blew it back up blurry; here we LEARN the two skinny sheets so
  that blowing them back up reproduces the ratings we DO have -- and fills the ones we don't.

  ## How the Skinny Sheets Are Learned

  We never saw the hidden tastes -- we invent them and tune them.  Start both skinny sheets
  with random numbers, then drive down one familiar score:

    for every rating we ACTUALLY have:
        guess = user-taste matched against movie-flavour
        miss  = real rating - guess
    total wrongness = sum of miss^2   (only over the cells we have!)
    nudge both skinny sheets to shrink that wrongness, and repeat

  The bottom-of-the-bowl, squared-miss idea from Chapter 1 -- one last time.  The one new
  wrinkle: we only sum the miss over the FILLED cells.  The blanks have no truth to compare
  against, so they sit out the scoring entirely.  Once the filled cells are well-predicted,
  the blanks come along for free when you multiply the two sheets.

  >> NOTE: A LEASH, BECAUSE THE SHEET IS MOSTLY HOLES
     With so few filled cells and freely-spinning hidden tastes, the machine will happily
     memorise the handful of ratings it has and predict nonsense in the blanks.  So we add
     the Chapter 4 leash -- a fine on big taste/flavour numbers -- to keep it humble.  Same
     leash, brand new place to need it.

  ## The Lab, End to End: 100 People, 50 Movies, 3,786 Holes

  Rule done -- now the real sheet from the lab.  100 people down the side, 50 movies
  across the top: 100 x 50 = 5,000 cells.  Only 1,214 hold a rating; the other
  5,000 - 1,214 = 3,786 are blank.

    blank fraction = 3786 / 5000 = 0.757     <- three-quarters of the sheet is holes

  The engine below fills every hole WITHOUT the nudging loop above -- no learning rate,
  no gradient.  It re-uses PCA's rebuild from Part 2 (crush to a few shadows, blow back
  up), run in rounds, and the blanks fill themselves.

  ## First Move: Humble Every Movie Column

  Subtract each movie's middle from every rating in its column.  The middle is computed
  from the PRESENT values only -- blanks do not vote, and blanks stay blank.

    one movie's 4 present ratings: 5, 4, 3, 4      (made-up)
    middle = 16 / 4 = 4

    the person who said 3  ->  3 - 4 = -1     liked it LESS than the crowd
    the person who said 5  ->  5 - 4 = +1     liked it MORE than the crowd
    the person who said 4  ->  4 - 4 =  0     exactly the crowd

  >> YOUR TURN
     A sixth person rated this same movie a 2.  Its middle is still 4 (the middle is
     fixed from the present values; a new low rating does not change the worked middle
     here).  Humble his 2.

     check your slate:  2 - 4 = -2.  He liked it two stars LESS than the crowd -- the
     most-below-the-middle voter so far.  The minus sign is information, not a penalty.

  Why by COLUMN and not by row?  A movie carries a baseline fame -- some films everyone
  rates high, some low.  Subtracting the movie's middle strips the fame off and leaves
  the pure signal: this person, more or less than the crowd, on this film.  After the
  subtraction, every column's middle parks at zero.

  ## Round 0: A Lazy Zero in Every Blank

  Write 0 into all 3,786 blanks (the 5,000 - 1,214 = 3,786 cells with no rating).  On the
  humbled sheet, 0 claims "this person would rate
  it exactly at the movie's middle" -- a guess that uses NO evidence about the person.
  Dumb, harmless, temporary: it budges no column's middle, and the rounds below will
  overwrite every one of these zeros with something earned.

  ## A Hand Recipe That Dies of Echo

  Before trusting any machine, I tried to fill a blank by hand.  Person 7 rated 5
  movies; his cell at movie m is blank.  The recipe:

    1. rank everyone by closeness to person 7 -- over the movies BOTH rated, take
       the differences, and average their sizes into one closeness number
    2. walk down the ranking; stop at the FIRST person who actually RATED movie m
    3. copy that person's humbled VALUE into person 7's blank

  One catch the recipe must dodge: trust whole-row closeness over one lucky shared
  movie.

                shared movies:    a     b     c     d     e
    person 11 differences:        0     3     4     2     3    <- perfect once, far elsewhere
    person 12 differences:        0.2   0.2   0.2   0.2   0.2  <- near on ALL -> copy HIM

  Person 11 agrees perfectly on one movie and disagrees everywhere else; person 12 is a
  little off on every movie.  Person 12 is the honest twin.

  And then the recipe dies -- caught by hand.  Say person 12 was the donor, and his
  values were copied verbatim into person 7's blanks.  Now re-rank everyone for the
  next blank: over those copied cells, the difference between person 7 and person 12
  is 0 BY CONSTRUCTION.  The ranking is no longer hearing evidence about person 7 --
  it is hearing its own echo, and each copy rigs the next ranking harder.  Dead end.

  The machine below never falls in this hole, because no blank is ever filled from ONE
  donor: every mark is blended from the entire sheet at once.

  ## The Machine Loop: Rebuild, Restore, Repeat

  IN HAND: a lab sheet of 100 people x 50 movies = 5,000 cells, of which 5,000 - 1,214 =
  3,786 are blank (3,786 / 5,000 = 0.757, three-quarters holes).  Each blank is humbled by
  its movie's middle, then started at a lazy zero.  This section turns those zeros into
  earned marks: rebuild the whole sheet, keep the blanks, restore the truth, repeat.

    round 0:  write the lazy zeros into the blanks
    round 1:  the machine re-writes the WHOLE sheet from its k strongest directions
              paste the machine's marks into the BLANKS ONLY
              put the rated cells back to the truth
    round 2:  the same, on the better sheet
    ...
    freeze:   this round's marks ~ last round's marks  ->  stop

  Two lines carry all the danger.

  First: the rebuild re-writes EVERY cell -- the rated ones included.  After one rebuild
  a true 0.30 comes back as 0.28.  Truth is not the machine's to edit.  So keep a MASK
  SHEET from the start -- one true/false per cell, "was this one blank?" -- and after
  every rebuild restore the rated cells from the original sheet.  On the working sheet
  your eye cannot tell a true value from a pencilled mark; only the mask knows.

  Second: why rounds at all?  Round 1's machine studied a sheet that was three-quarters
  fake zeros, so its marks are crude.  Round 2's machine studies a sheet whose blanks
  hold those crude-but-better-than-zero marks, and spits cleaner ones.  Each round eats
  a cleaner sheet and returns a cleaner sheet, until the marks stop moving.

  Count the clerk bill for this loop.  Each rebuild re-writes every cell: 100 people x 50
  movies = 5,000 cell-writes.  Run it for 20 rounds and the clerks lay down 5,000 x 20 =
  100,000 cell-writes before the marks freeze.  Heavy, but every stroke is plain
  arithmetic -- no magic, just patience.

  ## Inside the Machine, Knob Set to 1

  The machine has one knob: how many directions to keep.  Open the box at knob = 1.

  The bet at knob 1: every person is an ECHO of one shared taste-row.

    a made-up corner of a sheet:
       +2  +4  +6  +8      <- one row of real information
       +1  +2  +3  +4      <- the same row, at HALF strength
       -2  -4  -6  -8      <- the same row, at MINUS-ONE strength

  Three rows, but only ONE row of information.  Knob 1 stores exactly that: one
  BUILTROW (50 numbers, one per movie -- computed from the sheet, not stored in it,
  the same way the movie middles were) plus ONE STRENGTH per person (100 numbers).

    builtrow[m]       = sum over people( strength x his value at m ) / sum( strength^2 )
    person's strength = sum over movies( builtrow[m] x his value at m ) / sum( builtrow^2 )

  Look hard at those two lines: each one is the LADDER from Chapter 4, Part 3 -- the
  one-column slope sum(x*y)/sum(x^2) -- pointed at a different unknown.  The builtrow
  is fitted holding the strengths still; the strengths are fitted holding the builtrow
  still; alternate until neither moves.

  Check it by hand on the corner above.  Strengths (1, 1/2, -1); first column (2, 1, -2):

    top    = 1x2 + (1/2)x1 + (-1)x(-2) = 4.5
    bottom = 1 + 1/4 + 1               = 2.25
    builtrow entry = 4.5 / 2.25 = 2        <- the first entry of the real row.  correct.

  And a person whose row is (1, 2, 3, 4) against builtrow (2, 4, 6, 8):

    strength = (1x2 + 2x4 + 3x6 + 4x8) / (4 + 16 + 36 + 64) = 60 / 120 = 1/2   correct.

  >> YOUR TURN
     Same corner, same strengths (1, 1/2, -1).  Now build the SECOND entry of the
     builtrow from the second column (+4, +2, -4).  Do the top, the bottom, and divide.

     check your slate:
       top    = 1x4 + (1/2)x2 + (-1)x(-4) = 4 + 1 + 4 = 9
       bottom = 1 + 1/4 + 1               = 2.25
       entry  = 9 / 2.25 = 4
     The real row's second entry is +4 -- the ladder pulled it back out clean, just as
     it did the first entry (2).

  Once the two have settled, filling a blank costs ONE multiply:

    person 7's mark at movie m  =  strength_7  x  builtrow[m]

  Knob 2 stores TWO builtrows, and every person carries his OWN two strengths.  These
  are recipes, not camps -- nobody gets sorted into a group; each person mixes the
  builtrows in his own private proportions.  The knob = how many independent base rows
  you claim the sheet really has.

  ## Pick the Knob by Hiding What You Know

  IN HAND: a 100 x 50 = 5,000-cell lab sheet, 3,786 of them blank, humbled by movie
  middles and filled by the rebuild-restore loop -- whose one knob is how many builtrows
  to keep.  This section sets that knob honestly, without ever peeking at a true blank.

  A mark written into a blank can never be graded -- a blank has no answer behind it.
  So manufacture blanks whose answers you KEEP:

    hide 20 of every 100 RATED cells (pretend blank; pocket their truth)
    run the whole loop once per knob: 1, 2, 3, 5, 10
    uncover: compare each knob's marks against the pocketed truth -> one error per knob
    smallest error wins

  Hide only RATED cells.  Hiding a blank pockets nothing -- there is no truth to grade
  against later.  (Chapter 1's lock-away-the-exam habit, shrunk down to single cells.)

  >> NOTE: THIS LAB WAS SYNTHETIC, SO A FULL ANSWER KEY EXISTED
     The sheet was generated, and the true rating behind every blank sat in a second
     file.  Grading the finished marks against it: agreement around 0.89, typical miss
     around 0.24, on the humbled scale.  One trap inside that grading: humble the
     answer key with the SAME movie middles as the working sheet, not with its own --
     otherwise the two sheets sit in different frames and the comparison lies.

  ## Un-Humble and Recommend

  IN HAND: a full 100 x 50 = 5,000-cell sheet, every one of the 3,786 blanks now holding
  an earned mark, the knob picked by hiding rated cells -- but every number still in
  humbled units ("more or less than the crowd").  This section turns those marks back into
  stars a person can read.

  The finished sheet still speaks in humbled units ("more or less than the crowd").
  To speak in stars, add the movie's middle back, then clamp into the 1-to-5 range:

    predicted stars = mark + that movie's middle     (clamped into 1..5)

  Worked once: a blank came back with mark +1, and that movie's middle is 4.  Stars =
  1 + 4 = 5, already inside 1..5, so it stands at 5 -- a strong recommendation.

  >> YOUR TURN
     Another blank for the same movie (middle 4) came back with mark -3.5.  Un-humble it,
     then clamp into 1..5.

     check your slate:  stars = -3.5 + 4 = 0.5;  0.5 is below the floor of 1, so clamp
     up to 1.  Predicted 1 star -- the lowest the scale allows; do not recommend it.

  For one person: predict every movie he never rated, sort, and the top 5 is his
  recommendation list.  That single un-humble line is the point of the whole pipeline
  -- everything before it existed so this addition lands on a number worth showing him.

  ## Cold Start: The Honest Limit

  The method only works where there is something to chew on.  A brand-new user who has
  rated NOTHING has an all-blank row -- no tastes can be inferred, so no honest
  recommendation can be made.  Same for a movie nobody has rated yet.  This is the COLD
  START problem, and it is a real wall, not a bug you can code around: with zero ratings
  there is simply no signal.  Real systems paper over it with side facts (your age, the
  movie's genre) until enough ratings arrive.

  ## Where the Book Lands

  Look back at the whole arc.  We began in Chapter 1 guessing a house price with a straight
  stick and squared misses.  Six chapters later we are guessing a missing movie rating with
  two skinny sheets -- and it is the SAME squared-miss bowl, the SAME leash, the SAME
  scores-times-recipe shape as PCA.  The cast of tricks is small.  We just kept meeting them
  in new costumes: dials, leashes, wobble bands, question charts, committees, shadows,
  piles, and now hidden tastes.  Name the trick under the costume and the machine has no
  black boxes left in it.


  ## Common Tripwires I Caught

    TRIPWIRE 1:  Only score the cells you actually have.
       The wrongness sum runs over FILLED ratings only.  Counting the blanks (which
       have no truth) is meaningless and breaks the whole method.

    TRIPWIRE 2:  K (number of hidden tastes) is a knob, like everywhere else.
       Too few hidden tastes and the machine cannot tell sci-fi from romance; too
       many and it memorises noise in the handful of ratings.  Pick K by holding out
       some known ratings and checking the guesses -- the Chapter 1 habit, again.

    TRIPWIRE 3:  Hidden tastes are not labelled "sci-fi" by the machine.
       The machine finds K useful directions; it never names them.  "Sci-fi" and
       "romance" are stories WE tell after looking at which movies score high.  Same
       as PCA loadings -- the math finds the axis; the meaning is our interpretation.

    TRIPWIRE 4:  Cold start is a wall, not a tuning problem.
       No ratings means no tastes means no honest guess.  Do not fake it with the
       global average and call it personalised -- it is not.

    TRIPWIRE 5:  You still need a leash.
       A mostly-empty sheet is the easiest thing in the world to overfit.  Without
       the Chapter 4 penalty on big numbers, predictions in the blanks go wild.

    TRIPWIRE 6:  This is unsupervised in spirit, supervised in mechanics.
       There is no tidy answer column, yet we DO fit to known ratings.  It sits on
       the border -- learning hidden structure (unsupervised) by predicting observed
       values (supervised).  Do not box it too hard either way.

    TRIPWIRE 7:  Humbling SUBTRACTS, and the sign survives.
       3 - 4 = -1 means "liked it one star LESS than the crowd" -- information, not
       punishment.  And nobody "gets a negative for not going": blanks are skipped,
       not humbled.  Only present ratings move; blanks stay blank.

    TRIPWIRE 8:  The closeness number never gets written into a cell.
       It only ORDERS people -- most honest twin first.  What gets copied into the
       blank is the chosen person's VALUE.  Writing the closeness number itself into
       the sheet confuses the ruler with the thing it measured.

    TRIPWIRE 9:  Verbatim copying poisons the next ranking.
       Paste a donor's values into your blanks and your difference to him over those
       cells is 0 BY CONSTRUCTION -- the re-ranking now hears its own echo, not
       evidence.  This killed the hand recipe.  The machine survives because every
       mark is blended from the whole sheet, never copied from one donor.

    TRIPWIRE 10:  The rebuild edits EVERY cell -- restore the truth each round.
       After one rebuild a true 0.30 comes back 0.28.  Rated cells are not the
       machine's to edit: keep a mask sheet ("was this one blank?") and put the
       truth back after every rebuild.  On the working sheet, the eye cannot tell
       truth from pencil -- only the mask knows.

    TRIPWIRE 11:  Hide only RATED cells when grading a knob.
       A hidden blank pockets no truth, so there is nothing to grade against later.
       The 20% you hide must come from cells that actually hold ratings.


  ## The Labels, Last

    Plain term used above                 Standard label
    -----------------------------------   ------------------------------------------
    sheet full of holes                   sparse user-item rating matrix
    guess the blanks                      rating prediction / recommendation
    hidden tastes / flavours              latent factors
    number of hidden tastes (K)           latent dimensionality / rank
    two skinny sheets                     user and item factor matrices
    matched against (dot-product)         the inner product of factor vectors
    split the grid into two sheets        matrix factorisation
    learn from ratings of similar users   collaborative filtering
    score only the filled cells           loss over observed entries only
    the leash on big taste numbers        L2 regularisation (from Chapter 4)
    no ratings yet, no guess              the cold-start problem
    humble a movie column                 centring by the item (column) mean
    the movie's middle                    the item mean over observed ratings
    the lazy zero / round-0 fill          zero imputation (after centring)
    the mask sheet                        the boolean missing-data mask
    the machine's rebuild                 truncated SVD reconstruction
    rebuild-restore rounds                iterative SVD matrix completion
    the marks freeze                      convergence of the imputation loop
    the builtrow and the strengths        the rank-1 singular vector and its scores
    the knob (how many builtrows)         the rank
    hide rated cells and grade            validation masking / held-out cells
    un-humble (add the middle back)       un-centring / adding back the item mean


  ## The Code, If You Want It

  Nothing above needed a computer -- only pencils, clerks, and patience.  This last
  section is for the day you meet one: the same steps, spoken in Python.

  >> NEW TO PYTHON? Each named once:
       np.where(mask, a, b)   -- pick a where mask is true, else b (filled vs blank)
       X @ Y                  -- multiply the two skinny sheets back into a full grid
       np.isnan(R)            -- find the blank cells (NaN = "not a number" = blank)
       np.nanmean(R, axis=0)  -- column middles that SKIP the blanks
       np.linalg.svd(...)     -- the rebuild machinery (Part 2's shadows)
       np.allclose(a, b)      -- "are these two sheets the same, near enough?"
       np.clip(x, 1, 5)       -- clamp every number into a range

    import numpy as np

    # R = ratings grid, np.nan in the blanks
    mask = ~np.isnan(R)                 # True where we HAVE a rating
    n_users, n_movies = R.shape
    K = 2                               # hidden tastes

    rng = np.random.RandomState(42)
    U = rng.normal(size=(n_users, K))   # user tastes  (skinny, full)
    V = rng.normal(size=(n_movies, K))  # movie flavours (skinny, full)

    lr, leash = 0.01, 0.1
    for step in range(2000):
        pred = U @ V.T                  # blow the two sheets back to a full grid
        err = np.where(mask, R - pred, 0.0)   # miss on FILLED cells only
        U += lr * (err @ V - leash * U)       # nudge tastes (with the leash)
        V += lr * (err.T @ U - leash * V)     # nudge flavours

    full_grid = U @ V.T                 # every blank now filled -> recommendations

    # ---- the lab's engine: rebuild-restore rounds (no gradient) ----
    movie_middles = np.nanmean(R, axis=0)        # present values only
    C = R - movie_middles                        # humble; blanks stay NaN

    blank  = np.isnan(C)                         # the mask sheet
    filled = np.where(blank, 0.0, C)             # round-0 lazy zeros
    k = 3                                        # the knob (picked by hiding)

    for _ in range(20):
        Uu, s, Vt = np.linalg.svd(filled, full_matrices=False)
        rebuilt = (Uu[:, :k] * s[:k]) @ Vt[:k]   # keep k builtrows
        new = np.where(blank, rebuilt, C)        # blanks: marks; rated: truth restored
        if np.allclose(new, filled):             # freeze
            break
        filled = new

    stars = np.clip(filled + movie_middles, 1, 5)   # un-humble, clamp into 1..5

    # picking k: hide 20% of the RATED cells, run the loop per k in [1,2,3,5,10],
    # grade each k's marks against the pocketed truth, keep the smallest error.

    # the heavyweight toolbox route (the surprise package: from sklearn / surprise):
    #   from sklearn.decomposition import TruncatedSVD   # factor a filled matrix
    #   or a dedicated library (e.g. `surprise`) for the holey, regularised case


----------------------------------------------------------------------------------------------
  IN THIS CHAPTER (Chapter 6 -- Finding Patterns Without Answers):
    Part 1 -- Looking at a Sheet With No Answers .
    Part 2 -- The Strongest Direction (PCA) .
    Part 3 -- Grouping by Nearest Centre (K-Means) .
    Part 4 -- The Family Tree (Hierarchical Clustering) .
    Part 5 -- Both Tools on NCI60 (Re-visited) .
    Part 6 (this post)

  That is the end of the book.  For quick lookups, see the
  Appendix -- Classification Reference and the
  Glossary / Decoder Ring.
  <- Back to all posts
----------------------------------------------------------------------------------------------
  (c) 2026 Rahul Rai . pure HTML+CSS, no JavaScript, no trackers .
  home . source on GitHub
==============================================================================================