In the context of pattern recognition, I find some amusing irony in the Bitter Lesson's exhortation to disregard human knowledge-based algorithmic heuristics in favor of data scaling, as if the output labels were physical facts or measurements.
I am puzzled by this post. My high-level take on the Bitter Lesson's key claim is "for problems where scaling in terms of data and compute are possible, this will eventually outperform any inductive bias you hand-design." And I think this looks pretty plausible!
On games, agreed that while chess (and also go) definitely have deterministic algorithms that solve them, these algorithms are so complex that they're essentially irrelevant. Instead, strong progress was made on these games when deployed pattern recognition techniques by building increasingly sophisticated position evaluators. And here, you see that decades of work on HPC and building customized position evaluators (fun fact, I had a UROP at MIT writing assembly language to speed up position evaluators for chess in Charles Leiserson's group) were sufficient to beat human champs at Chess, but were nowhere near sufficient at Go, and we only got champion computer Go players (and much better and cheaper computer chess players) when we learned the position evaluators from data --- turning the core of the game into a pattern recognition problem.
Similarly, for vision, I think there's at least a reasonable argument to be made that vision transformers have weaker inductive biases then CNNs, although admittedly it's not totally obvious?
I'm not convinced this comment is entirely in good faith, but I'll try and respond.
I (roughly, this is not a precise math definition) think of the inductive bias as the choices you make in selecting your function classes for learning. In some sense you have to do this to avoid No Free Lunch theorem kinds of problems.
Comparing CNNs and vision transformers (ViTs), intuitively, CNNs are much more heavily biased towards highly localized features than ViTs. ViTs can pay global attention to patch-level tokens, so can aggregate full image information quite early, whereas CNNs are forced to slowly increase the receptive field layerwise. CNNs were *very* roughly designed to loosely mimic known details of early vision (local receptive fields, convolution and pooling), whereas ViTs had much more of a "let's just throw a complex function class at this and let it go", which I consider a weaker inductive bias. I believe it is relatively much easier to port a function implemented on a CNN onto a ViT than vice-versa (e.g., https://arxiv.org/abs/1911.03584). Taking these together, I'd say the ViT learns over a larger, more expressive class of functions that was less explicitly inspired by how humans think vision should work, and this is (again roughly) what I mean by "less inductive bias."
This is all above my pay grade and I have the pleasure of graduating middle of the pack from Canada's worst ranked university. But is the point that there is a difference between "games" vs. non-game scenarios. Ie. Where the rules are defined and created explicitly by humans and the worlds are "small" (a la Savage, Binmore, or Ludic Fallacy by N.taleb). These games are tough for humans (which is why we like em) but some of our current methods may not be suited for "large" non-artificial worlds (reality) where the rules aren't even truly defined (someone has to chose the rules to optimize over) do you don't even really know if everyone else is playing the same game ... Or if it's really a game at all. But perhaps I'm missing the point ...
I am not sure the distinction between deterministic and non-deterministic is meaningful. The Game of Life is fully deterministic and yet there is no practical way of predicting the outcome except for running the simulation. You can say that philosophically the nature of uncertainty in TGOL is different from that of the real world, as we know the exact underlying rules of one but not the other. This might be true, but I don't see many real implications of that distinction. If the dreams of Wolfram were to come true and the world turned out to be a deterministic cellular automaton, would it change anything fundamental in our interactions with the reality?
Sure, abstractly Go can be solved by a lookup table. But if the size of that table exceeds the storage available in our universe, this seems like a meaningless claim. As Rif mentioned above, the progress in Go came from rapid generation of automated heuristics based on ML with minimum human input. In that sense, it is quite different from Deep Blue, where much of the effort went into formalization of human intuitions. I take "the bitter lesson" to mean that the human intuitions are far less useful than we were led/would like to believe, this seems to be a valid demonstration of that principle.
I'm not making a distinction between deterministic and nondeterministic. I'm making a distinction between board games (where rules are well defined constructs and never change) and pattern recognition more broadly.
I agree that there is a distinction. I am just not sure how important it is for sufficiently complex games. Suppose we discovered Wolfram-type rules for the universe, would it change much about anything above the level of particle physics?
I am probably not understanding. But is it obvious that there is a win condition for TGOL and the universe even if the latter was found to have some set of rules? Like, who wins at the end of the universe? Is finding mapping of how the CA transitions from one state to the other the same problem as the mapping to find the optimal trajectory in a sufficiently complex game? I guess you may be able to cast say TGOL in terms of a game (find the optimal trajectory to win) but isn't it the case that here there is no optimal trajectory other than "one one tick forward?". It seems games by definition are are winnable, and tractable with limited about of attention and memory.
Just to add: The difference between AlphaGo and AlphaZero is the amount of human knowledge injected into the program. AlphaZero knows only the game rules and improves through self-play (a lonely child in the summer vacation!), it exceeded the human-master play in a few hours. There are documentaries about this in youtube for the interested. While searching for the documentaries, I have found out this 2018 New Yorker article which is perhaps better. https://www.newyorker.com/science/elements/how-the-artificial-intelligence-program-alphazero-mastered-its-games
I see the bitter truth as a focus on “hardware” or architecture, not “software”. A good architecture (capable of say approximating an optimal thing as well as possible, a rich class) and an optimization potential is more important than problem specific, hand-crafted solutions.
This is indeed bitter for an educator in a field giving its highest degree to the top humans working very hard for many years to specialize in a very narrow area.
> So we should learn a lesson. Let’s try to apply it: a multilayer perceptron with one hidden layer, a convolutional neural network, and a transformer all “leverage” computation and don’t “leverage their human knowledge of the domain.”
I would say an MLP doesn't leverage computation (in Sutton's sense of "using a lot of computation"). Do you mean maybe an extremely wide one?
Between an CNN and a Transformer, arguably the CNN uses more human knowledge, in that it explicitly encodes which pixels "talk to" each other, whereas the Transformer allows a little more of this to be learned. But ok, again this is not a rigorous statement. In fact I would say the Transformer is not really that much better than other architectures, except that in practice it is easier to parallelise and scale training - which is Sutton's point.
In practice, doesn't pattern recognition mean classification that matches some set of predetermined judgements, for which the basic method is linear discriminant analysis.
In the context of pattern recognition, I find some amusing irony in the Bitter Lesson's exhortation to disregard human knowledge-based algorithmic heuristics in favor of data scaling, as if the output labels were physical facts or measurements.
Yes! The output labels, the data sources, the problem domains...
I am puzzled by this post. My high-level take on the Bitter Lesson's key claim is "for problems where scaling in terms of data and compute are possible, this will eventually outperform any inductive bias you hand-design." And I think this looks pretty plausible!
On games, agreed that while chess (and also go) definitely have deterministic algorithms that solve them, these algorithms are so complex that they're essentially irrelevant. Instead, strong progress was made on these games when deployed pattern recognition techniques by building increasingly sophisticated position evaluators. And here, you see that decades of work on HPC and building customized position evaluators (fun fact, I had a UROP at MIT writing assembly language to speed up position evaluators for chess in Charles Leiserson's group) were sufficient to beat human champs at Chess, but were nowhere near sufficient at Go, and we only got champion computer Go players (and much better and cheaper computer chess players) when we learned the position evaluators from data --- turning the core of the game into a pattern recognition problem.
Similarly, for vision, I think there's at least a reasonable argument to be made that vision transformers have weaker inductive biases then CNNs, although admittedly it's not totally obvious?
So I'm not sure what you're actually arguing.
What is inductive bias?
I'm not convinced this comment is entirely in good faith, but I'll try and respond.
I (roughly, this is not a precise math definition) think of the inductive bias as the choices you make in selecting your function classes for learning. In some sense you have to do this to avoid No Free Lunch theorem kinds of problems.
Comparing CNNs and vision transformers (ViTs), intuitively, CNNs are much more heavily biased towards highly localized features than ViTs. ViTs can pay global attention to patch-level tokens, so can aggregate full image information quite early, whereas CNNs are forced to slowly increase the receptive field layerwise. CNNs were *very* roughly designed to loosely mimic known details of early vision (local receptive fields, convolution and pooling), whereas ViTs had much more of a "let's just throw a complex function class at this and let it go", which I consider a weaker inductive bias. I believe it is relatively much easier to port a function implemented on a CNN onto a ViT than vice-versa (e.g., https://arxiv.org/abs/1911.03584). Taking these together, I'd say the ViT learns over a larger, more expressive class of functions that was less explicitly inspired by how humans think vision should work, and this is (again roughly) what I mean by "less inductive bias."
This is all above my pay grade and I have the pleasure of graduating middle of the pack from Canada's worst ranked university. But is the point that there is a difference between "games" vs. non-game scenarios. Ie. Where the rules are defined and created explicitly by humans and the worlds are "small" (a la Savage, Binmore, or Ludic Fallacy by N.taleb). These games are tough for humans (which is why we like em) but some of our current methods may not be suited for "large" non-artificial worlds (reality) where the rules aren't even truly defined (someone has to chose the rules to optimize over) do you don't even really know if everyone else is playing the same game ... Or if it's really a game at all. But perhaps I'm missing the point ...
I am not sure the distinction between deterministic and non-deterministic is meaningful. The Game of Life is fully deterministic and yet there is no practical way of predicting the outcome except for running the simulation. You can say that philosophically the nature of uncertainty in TGOL is different from that of the real world, as we know the exact underlying rules of one but not the other. This might be true, but I don't see many real implications of that distinction. If the dreams of Wolfram were to come true and the world turned out to be a deterministic cellular automaton, would it change anything fundamental in our interactions with the reality?
Sure, abstractly Go can be solved by a lookup table. But if the size of that table exceeds the storage available in our universe, this seems like a meaningless claim. As Rif mentioned above, the progress in Go came from rapid generation of automated heuristics based on ML with minimum human input. In that sense, it is quite different from Deep Blue, where much of the effort went into formalization of human intuitions. I take "the bitter lesson" to mean that the human intuitions are far less useful than we were led/would like to believe, this seems to be a valid demonstration of that principle.
I'm not making a distinction between deterministic and nondeterministic. I'm making a distinction between board games (where rules are well defined constructs and never change) and pattern recognition more broadly.
I agree that there is a distinction. I am just not sure how important it is for sufficiently complex games. Suppose we discovered Wolfram-type rules for the universe, would it change much about anything above the level of particle physics?
I am probably not understanding. But is it obvious that there is a win condition for TGOL and the universe even if the latter was found to have some set of rules? Like, who wins at the end of the universe? Is finding mapping of how the CA transitions from one state to the other the same problem as the mapping to find the optimal trajectory in a sufficiently complex game? I guess you may be able to cast say TGOL in terms of a game (find the optimal trajectory to win) but isn't it the case that here there is no optimal trajectory other than "one one tick forward?". It seems games by definition are are winnable, and tractable with limited about of attention and memory.
Just to add: The difference between AlphaGo and AlphaZero is the amount of human knowledge injected into the program. AlphaZero knows only the game rules and improves through self-play (a lonely child in the summer vacation!), it exceeded the human-master play in a few hours. There are documentaries about this in youtube for the interested. While searching for the documentaries, I have found out this 2018 New Yorker article which is perhaps better. https://www.newyorker.com/science/elements/how-the-artificial-intelligence-program-alphazero-mastered-its-games
I see the bitter truth as a focus on “hardware” or architecture, not “software”. A good architecture (capable of say approximating an optimal thing as well as possible, a rich class) and an optimization potential is more important than problem specific, hand-crafted solutions.
This is indeed bitter for an educator in a field giving its highest degree to the top humans working very hard for many years to specialize in a very narrow area.
> So we should learn a lesson. Let’s try to apply it: a multilayer perceptron with one hidden layer, a convolutional neural network, and a transformer all “leverage” computation and don’t “leverage their human knowledge of the domain.”
I would say an MLP doesn't leverage computation (in Sutton's sense of "using a lot of computation"). Do you mean maybe an extremely wide one?
Between an CNN and a Transformer, arguably the CNN uses more human knowledge, in that it explicitly encodes which pixels "talk to" each other, whereas the Transformer allows a little more of this to be learned. But ok, again this is not a rigorous statement. In fact I would say the Transformer is not really that much better than other architectures, except that in practice it is easier to parallelise and scale training - which is Sutton's point.
Your remarks on “evaluation by competitive testing” ring true.
Do you think it’s an accident that evaluation by competitive testing has been so effective with pattern-recognition problems?
Or do you think (say) physics or chemistry could also be improved by competitive testing?
In practice, doesn't pattern recognition mean classification that matches some set of predetermined judgements, for which the basic method is linear discriminant analysis.
Always happy to see the Prophet of AI, David Hume, making an appearance on Substack. He is the muse of Systematica - https://billatsystematica.substack.com/p/brute-force-hume-and-human-ai