The Netflix Prize, launched in 2006, was an amazing example of open, collaborative research. Netflix wanted to improve its recommendation methods, so it released a giant, partially anonymized data set. The data set consisted of 500 thousand people’s ratings of 18 thousand movies on a scale of 1 to 5. Most Netflix members only rated a small subset of the catalog, so there were only 100 million ratings (about 1% of the total possible ratings). The goal of the contest was to predict 2 million held-out ratings by individuals in the data set. And if you could make those predictions 10% better than Netflix’s internal method, you’d win a million dollars.
After 3 years of competition, a winner was crowned. And we learned a lot from the contest! First, just doing low-rank matrix completion on the data set outperformed the Netflix baseline by several percent. You could get about halfway to the goal from the ratings alone using no information about the movies themselves. This was surprising in 2006. Content should have had to matter. But no, people’s behaviors have patterns, and a few example people are enough to predict your personal recommendation pattern. We are not that unique, at least not when we interact with the Netflix corporation. There is a lot of redundancy in the ratings matrix, and one can extract the principal components of the ratings matrix without seeing all of the entries. Your humble blogger here has written a lot about this…
The first mile was a breeze, but to get across the finish line was a mess. To the simple matrix completion algorithm, people added temporal features (when a was rating made), ensemble methods, and other tricks. Though no one was using neural networks, hugely overparameterized models were needed to get the final 1% needed to win the prize pot.
But, to the surprise of information retrieval experts, the winning solution used almost nothing about the movies themselves, only adding the year of the movie’s release and whether the title had a number in it. That’s it! The rest of the predictions were made entirely using the interaction history of Netflix subscribers.
Let’s flash forward 17 years. What did we learn? I know that I wouldn’t have a job if not for this contest. But I worry that the recommender systems and machine learning communities learned a bunch of bad lessons.
First, did we learn whether improving Netflix’s internal recommender system improved the quality of movie recommendations on Netflix? I don’t think we did! Netflix removed it’s five star rating system soon after the competition. Moreover, with the rise of streaming, competitive platforms, and stingy media rights holders, Netflix’s catalog collapsed. Though they featured over 18000 titles in 2006, they now only offer around 3600 movies in the United States. The catalog is over five times smaller. For a variety of reasons, the company decided to roll with less content, more in-house produced content, and sloppier recommender systems. But what’s also interesting to me is the computational demand of their recommender system should be substantially less in 2023 than in 2006. In 2006, computing the principal components of an 18000 x 18000 matrix was a real challenge, but a 3600 x 3600 matrix was very manageable on a modest workstation. Would the winning solution have been different with a smaller catalog?
Second, we learned from the Netflix Prize that mass surveillance was a surprisingly good way to recommend content to people. And hence, the great swallowing of our data commenced. The Netflix Prize solution argued that you could drive more engagement by looking at a userbase’s interactions with a service. The content they were consuming didn’t matter for engagement, just that you could provide more content based on behavioral cues. And so big companies record every interaction we make with their products, and then data mine the hell out of this surveillance to “improve” (aka sell more ads). But did we overlearn this lesson from this one competition? There have been zero other large-scale open data sets that capture the essence of recommendation since the Netflix Prize. We’ll never know.
Why have there not been more open competitions? Well, that’s lesson three. We learned that releasing any data could reveal private information about people. Narayanan and Shmatikov showed that you could identify people by matching their Netflix ratings against their IMDB profiles. And hence the public IMDB profile could reveal private Netflix data. The logical leap is then we’d know all sorts of things about people based on their private viewing history. I don’t know. This still seems weirdly far-fetched to me. Given the actuality of cyberstalking and doxxing of the past 20 years, the actual harm model of having public recommender system data sets seems pretty wildly far-fetched to me. But after Narayanan and Shmatikov’s paper, four people sued Netflix, Netflix settled and pulled the Prize data, and no one ever dared assemble a public recommender system dataset again.
This is such a pyrrhic victory for privacy advocates. We decided that corporate power was more important than open data. We now live in a world where companies spy on everything we do at all times. They use this information to make money from advertising. Their products are only getting worse. They feed our input into giant LLMs that they try to sell back to us. Why exactly do we trust these companies with our data? Is it really better that a few giant companies get to know everything about us? We know they can be compelled to share that information with the government. But we can’t release it publicly because there will be lawsuits. Ah, the joys of our bizarre relationship with “privacy.”
Anyway, sometimes my blogs don’t have happy endings or positive solutions (I’ll have a fun and positive blog for you tomorrow! I promise). This was just something that has been bugging me since a chat with a friend yesterday. And I’ve also been thinking about these privacy issues as they relate to medical data, where so much shoddy research is uncheckable because of rough claims about patient privacy. I wish we had a better solution to balance our idealistic views of human-subject research with the reality of the harms of closed data. If you know of any efforts toward such a balance, please let me know in the comments.
I had some tangential info about the competition as it was going on, as I know an anthropologist who was doing some work with Netflix at the time. What I will note is that the contextless recommendation setups (i.e. content doesn't matter) didn't really do much for users. And this basically leads us to where we are with streaming today. Oversimplifying to keep it short: "finding something people will watch" is very different to "helping people find something they want" - but the latter only matters when there is real competition (which Netflix at the time, did not suffer from).
Interesting rant! A relevant dimension to the Netflix prize aftermath is that quirky fact that movie rental/viewing records are one of the few things that actually have a privacy law in the United States. See the wikipedia page for the so-called Video Privacy Protection Act (VPPA) and the bits about Robert Bork. https://en.wikipedia.org/wiki/Video_Privacy_Protection_Act
I am not a lawyer, but I've come to understand that the VPPA was the main legal hook for many of the lawsuits against Facebook Beacon as well as many other US privacy lawsuits in the modern era, most of which have settled. To be a bit simplistic, Facebook had a privacy issue and it sorta didn't matter under US law, but for the fact that it was leaking video viewing history.