Prediction Games

The Netflix Prize and the structure of machine learning competition

Feb 05, 2025

In 2006, Netflix launched an open competition that would transform machine learning practice. They offered a million dollars to anyone who could improve upon their in-house recommendation system by 10%.

Let Grandpa Ben over here tell the youths how Netflix worked in 2006. I was wearing an onion on my belt, which was the style at the time. You see, Netflix used to send you movies in the mail. You’d use the website to pick the movies you wanted to watch, and they would mail you them imprinted on this technology called DVDs. DVDs were shiny discs that could also be used to make mobiles. After you watched the movies, you’d tie an onion to your belt, which was the style at the time, and then walk over to a mailbox and send the DVDs back to Netflix. You’d then sit down at your VAX terminal and visit the Netflix website to rate the movies on a scale of 1 to 5. From these ratings, they’d know what to recommend to you next.

You see, in 2006, Netflix had an overabundance of movies. Over 20,000 of them. Finding ways to send rare and unusual suggestions your way seemed good for business.

The Netflix Prize competition was pretty simple. Netflix released a data set of 100 million total ratings from around 500 thousand subscribers, scoring about 18,000 movies. People rated movies on a scale from 1 to 5, and Netflix also provided the date of each review. The goal was to predict the ratings on a list of 3 million triples of the form (person_id, movie_id, and date). The scale of the contest reflects the state of computing in 2006. The data was just large enough to be hard to process but still computable on a laptop.

A team’s score was the root-mean-squared error of their submitted rating predictions. Netflix’s initial system scored 0.95 in this metric. Teams could upload their 3 million guesses once a week through a web service (however, this was soon changed to once a day). Netflix used half of the uploaded data to compute a score for a public leaderboard, demonstrating progress thus far. They kept the score on the other half private and used this to determine the winner.

The competition was wildly successful. All sorts of people, from industrial researchers to college undergraduates, threw in their hats. In the end, Netflix received submissions from over five thousand different teams. Reed Hastings estimated they were getting PhD-level R&D for less than a dollar an hour. The hourly billing rate in machine learning is a bit more than that these days.

So what did we learn from the Netflix prize?

In less than a month, we learned that simple methods worked surprisingly well. A famous submission by “simonfunk,” an independent software developer from New Zealand, got to 9th place with the following code:

for (user,movie,rating) in rating_tuples:
    err = rating - dot(movieEmbedding[movie], userEmbedding[user])
    uE = userEmbedding[user]
    userEmbedding[user] += lrate * err * movieEmbedding[movie]
    movieEmbedding[movie] += lrate * err * uE

This submission was already worth more than a million dollars. Most recommendation systems still use some variety of this algorithm: trying to do a factor analysis of the ratings array. The code here is computing a singular value decomposition of the matrix of all ratings using stochastic gradient descent. Interestingly, SGD on the SVD gets a good approximation well before the algorithm sees all of the matrix entries. I could write a lot about why, but that’s for another day.1

Given my last week of posting, you all know it’s my hobby horse, but the Netflix Prize also taught us a lot about the insignificance of overfitting. Constantly climbing the leaderboard did not lead to overfitting. The scores at the top of the board were pretty much the same on the private and public test sets. The evidence that leaderboard score was indicative of private score was clear by 2007. People ignored this for 15 years. Some still deny it today.

The Netflix Prize also provided evidence that making machine learning models bigger, whether through big ensembles or giant nonparametric models, improved scores. This is a plot from a retrospective by two of the winners at Bell Labs, Yehuda Koren and Robert Bell:

Larger models get better scores. Such “scaling laws” are not some unique property of deep neural networks. The Netflix Prize provided yet another piece of evidence that having more parameters than data points does not imply overfitting.

Most importantly, we learned how well-run competitions drive progress. A clear, unbiased leaderboard with clear rules of competition led to not only winning money but widely applicable insights about recommendation systems. The rules and structures of the Netflix Prize are the same as those used in our modern machine learning competitions. There would be no Kaggle and Imagenet without the Netflix Prize. For better and for worse, machine learning competitions remain the central core driving machine learning “progress.”

Finally, I’d be remiss if I didn’t end with a rant about the bitter lesson of the Netflix Prize. A famous paper by Arvind Narayanan and Vitaly Shmatikov showed that you could correlate Netflix Prize data against public data from IMDB to identify people in the competition. There were then some very sketchy lawsuits about how plaintiffs were harmed by such deidentification. The FTC also wrote complaints to Netflix. The lawsuits were settled out of court, and Netflix announced that it was out of the competition game. They pulled their data and canceled future competitions.

This is a bitter lesson about the interplay between techlash activism and big tech power structures. Twenty years of privacy complaints have only made tech companies more powerful. In this case, today there are no decent public data sets on recommendation systems, even though recommendation systems power the revenue of trillion-dollar companies. The companies still slurp all of our data, but our ability to audit their practices or create competing services has been blunted by fear of lawsuits and regulation. Tech’s stranglehold on our data and media consumption is at its highest point ever. Is all we got out of activism the honor of clicking to accept cookies?

By the way, there were many deep learning submissions, notably from Mnih, Salakhutdinov, and Hinton. They did not perform much better than simonfunk’s simple code ~~and were not part of the winning team’s submission.~~ ETA: they did appear in the final ensemble models used to win the prize, but were not alone sufficient to claim the reward. See Joe Sill’s comment.

Kameron Decker Harris

Feb 5

This isn't a comment about the ML part but about movies. Netflix was part of the trends that killed movie stores. There are so many good movies out there, and movie stores were how you used to find them. Netflix, when it started, had a lot of good movies on the platform. Now they have a few gems that they keep recommending I rewatch. The overall set of quality films are now spread out over any number of competing platforms. You have to fight and search all over the place to find anything worth watching. It is a good example of enshittification caused by tech.

The only subscription I pay for these days is Criterion, on the recommendation of a cinephile friend. Most of their catalog is old, but I like discovering this stuff. We watched "The Devil's Eye" by Ingmar Bergman last night.

Expand full comment

1 reply by Ben Recht

Andrew

"Is all we got out of activism the honor of clicking to accept cookies?"

No, we also got increased barriers to entry, leading directly to centralization, censorship, and the destruction of all that was once good about the internet.

12 more comments...

arg min

Discussion about this post