Discussion about this post

User's avatar
rif a saurous's avatar

I am puzzled by this post. My high-level take on the Bitter Lesson's key claim is "for problems where scaling in terms of data and compute are possible, this will eventually outperform any inductive bias you hand-design." And I think this looks pretty plausible!

On games, agreed that while chess (and also go) definitely have deterministic algorithms that solve them, these algorithms are so complex that they're essentially irrelevant. Instead, strong progress was made on these games when deployed pattern recognition techniques by building increasingly sophisticated position evaluators. And here, you see that decades of work on HPC and building customized position evaluators (fun fact, I had a UROP at MIT writing assembly language to speed up position evaluators for chess in Charles Leiserson's group) were sufficient to beat human champs at Chess, but were nowhere near sufficient at Go, and we only got champion computer Go players (and much better and cheaper computer chess players) when we learned the position evaluators from data --- turning the core of the game into a pattern recognition problem.

Similarly, for vision, I think there's at least a reasonable argument to be made that vision transformers have weaker inductive biases then CNNs, although admittedly it's not totally obvious?

So I'm not sure what you're actually arguing.

Expand full comment
James McDermott's avatar

> So we should learn a lesson. Let’s try to apply it: a multilayer perceptron with one hidden layer, a convolutional neural network, and a transformer all “leverage” computation and don’t “leverage their human knowledge of the domain.”

I would say an MLP doesn't leverage computation (in Sutton's sense of "using a lot of computation"). Do you mean maybe an extremely wide one?

Between an CNN and a Transformer, arguably the CNN uses more human knowledge, in that it explicitly encodes which pixels "talk to" each other, whereas the Transformer allows a little more of this to be learned. But ok, again this is not a rigorous statement. In fact I would say the Transformer is not really that much better than other architectures, except that in practice it is easier to parallelise and scale training - which is Sutton's point.

Expand full comment
3 more comments...

No posts