I appreciate everyone’s feedback on Monday’s post. It’s helpful when we all stare at the magic mirror together to see what we see. I’m now compelled to spend the rest of the week on this Bitter Lesson as there’s a lot to unpack. And also because I like bitter things.
I want to first pick on some peculiarities of the essay, and then turn to why I think Sutton, who was an expert in games but not pattern recognition, misunderstood the transformations that occurred in supervised learning in the 2010s. In subsequent posts, I’ll turn to my own bitter lessons from the 2010s.
One thing that always bothered me about this essay was Sutton’s examples conflating parlor games/Gesellschaftsspiele and pattern recognition. His four examples are chess, go, speech recognition, and computer vision.
Board games and pattern recognition are fundamentally two completely different problems. And their computational solutions were also quite different, even though both use “neural networks” as part of the solution. Chess and Go are both perfect information games. They have clear, unambiguous rules and clear definitions of a win, a loss, and a draw. John von Neumann and Oskar Morgenstern went so far as to call chess trivial because there is definitely a deterministic algorithm that solves the game.1 You could either write a codebook to predict the optimal next move from any board state, or you could enumerate a game tree down to 40 moves and backtrack an optimal move. There is undoubtedly an algorithm, though one of unfathomable complexity, that solves chess.
Building a computer program to approximately solve chess is, of course, a daunting engineering challenge. Claude Shannon proposed the algorithm that would beat chess in 1950—before we even had computers—to attempt approximations. Shannon proposed performing a truncated tree search with a reasonable function at the nodes to evaluate the quality of a board position. It would take decades for computers to be fast enough for Shannon’s methods to play competitive chess. But riding the wave of Moore’s Law, it was only a matter of time before Shannon search would be closer to the optimal playbook than any person’s strategy. Shannon’s algorithm would form the basis of the strategy that IBM programmed into DeepBlue to beat Kasparov in 1997.
However, that Shannon’s approach worked is only a bitter lesson if you were committed to the Newell-Simon program of using chess to understand human expertise. People do not play chess the way computers do, but the idea that a computer with enough power could beat a human was known in the 1940s! Though the algorithms are a bit more complicated and the computing hardware a few generations later in Moore’s Law exponential wake, the solution to Go wasn’t a huge leap from this basic strategy of tree search and heuristic board evaluation.
Pattern recognition, on the other hand, is fundamentally a problem of induction. What is the goal of pattern recognition? I give you a sequence of examples and ask you to infer a property of the next element of the sequence. There is not and cannot be a codebook that mathematically provably solves this problem. We have known this for three hundred years. Unlike in chess, there cannot be a mathematical proof that a pattern recognition system works.
Perfect information games are trivial. Pattern recognition is not.
I don’t think I fully appreciated this point until recently, so let me not pick on Sutton for not realizing it in 2019. How pattern recognition differs from perfect information games helps illustrate why the current landscape of machine learning engineering design is frustratingly convoluted.
Sutton writes about computer vision
“Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.”
SIFT was from 1999. I know this is “ancient history” now, but I would not call it “early” computer vision. SIFT features were invented after convolutional neural networks, and the two co-existed for different purposes in the computer vision community. Yann Lecun, no fan of shallow learning methods, would always sing the praises of SIFT, notably because SIFT had been rejected three times by illustrious computer vision conferences.
Having looked at SIFT features myself, I’m not sure I’d say that they leverage human knowledge of the domain more than convolutional neural networks. As Yann Lecun also liked to point out, SIFT features are a convolutional neural network, just one where David Lowe crudely hand-coded the weights. If you look at the lowest layers in a convolutional neural net, you always see simple SIFT-like edge detectors.
Moreover, if you want to solve pattern recognition, it isn’t sufficient to merely output SIFT features. You have to apply some sort of classification function to them, be it linear or nonlinear. Nonlinear classification on SIFT features “leverages” computation in the sense that if you have enough data, its error rates decrease.
However, convolutional neural networks undoubtedly outperformed SIFT-based methods in the 2012 ImageNet competition. And people immediately saw how to “scale” AlexNet, whereas it was not clear how to move beyond the 2nd place, SIFT-based entry.2
Now, Sutton is vaguely alluding that SIFT used more human knowledge than conv-nets and that this hindered progress. So we should learn a lesson. Let’s try to apply it: a multilayer perceptron with one hidden layer, a convolutional neural network, and a transformer all “leverage” computation and don’t “leverage their human knowledge of the domain.” Why do we use transformers instead of MLPs or conv-nets?
The Bitter Lesson has nothing to say, but the answer is clear. It’s because transformers seem to work better on evaluations people care about at the scale of data they care about. Is that a fully rigorous statement? No. Are the rules of evaluation as cut and dried as they are in parlor games? No. If Pattern Recognition is a game, it’s one of Calvinball. But Calvinball competitive testing is enough for machine learning people to keep hammering away and getting more papers, VC funds, 9 figure salaries, and commitments for protectionist trade policies.
What unites parlor games and pattern recognition is evaluation by competitive testing. Whatever seems to work best today on a particular leaderboard is what people run with tomorrow. It is undeniable that competitive testing has been the primary driver of the academic and industrial branches of machine learning, which has consumed the rest of artificial intelligence. That said, I worry we have lost something in the embrace of pure scaling through Calvinball. This design pattern can’t tell us what to do after we hit the inevitable wall of diminishing returns.
I write about this history of computational game play in The Irrational Decision. Shameless plugs will abound until the book comes out. Preorder links: [Amazon], [Bookshop]
The 2nd place team did not do themselves any favors by having a closed code base that was infuriatingly challenging to replicate.
I am puzzled by this post. My high-level take on the Bitter Lesson's key claim is "for problems where scaling in terms of data and compute are possible, this will eventually outperform any inductive bias you hand-design." And I think this looks pretty plausible!
On games, agreed that while chess (and also go) definitely have deterministic algorithms that solve them, these algorithms are so complex that they're essentially irrelevant. Instead, strong progress was made on these games when deployed pattern recognition techniques by building increasingly sophisticated position evaluators. And here, you see that decades of work on HPC and building customized position evaluators (fun fact, I had a UROP at MIT writing assembly language to speed up position evaluators for chess in Charles Leiserson's group) were sufficient to beat human champs at Chess, but were nowhere near sufficient at Go, and we only got champion computer Go players (and much better and cheaper computer chess players) when we learned the position evaluators from data --- turning the core of the game into a pattern recognition problem.
Similarly, for vision, I think there's at least a reasonable argument to be made that vision transformers have weaker inductive biases then CNNs, although admittedly it's not totally obvious?
So I'm not sure what you're actually arguing.
> So we should learn a lesson. Let’s try to apply it: a multilayer perceptron with one hidden layer, a convolutional neural network, and a transformer all “leverage” computation and don’t “leverage their human knowledge of the domain.”
I would say an MLP doesn't leverage computation (in Sutton's sense of "using a lot of computation"). Do you mean maybe an extremely wide one?
Between an CNN and a Transformer, arguably the CNN uses more human knowledge, in that it explicitly encodes which pixels "talk to" each other, whereas the Transformer allows a little more of this to be learned. But ok, again this is not a rigorous statement. In fact I would say the Transformer is not really that much better than other architectures, except that in practice it is easier to parallelise and scale training - which is Sutton's point.