13 Comments
User's avatar
Yuval Rabani's avatar

"Everyone in computer science has bought into the pact of evaporating oceans for better coding assistants."

That's because nobody in computer science likes coding, or teaching programming courses. Evaporating an ocean after teaching a class using Java is a mild reaction.

But some of us still think that better scaling means something other than exponential growth rate of energy bills or Nvidia's valuation.

Expand full comment
Ben Recht's avatar

Yes. There are dozens of us! :)

Expand full comment
Yuval Rabani's avatar

"It is not in numbers, but in unity, that our great strength lies; yet our present numbers are sufficient to repel the force of all the world."

Expand full comment
Padarn Wilson's avatar

I like this post and this reply: but do you mind elaborating on what you mean by “scaling”? I.e what is that alternative?

Expand full comment
Kshitij Parikh's avatar

"I could write another thousand words about the confusing examples of parlor games and supervised learning that Sutton presents as evidence of what taught him his lesson" -> Please do.

Expand full comment
Taorui Wang's avatar

I don't mean to be aggressive. But I wonder if you test the most advanced model in openai by yourself with your own problem ? While Sutton's argument unfairly attaches his weird assumption to many researchers, my personal experience with the LLM is that they are improving greatly from GPT 3.5. I am doing computational math with application in various fields. I test the GPT 5 thinking and pro with long proof statements and concepts in stochastic control. It can consistently generalize those math deduction pattern from control problem to HJB and even consider the math constraints and potential weakness of its own arguments. There are some minor mistakes and error. But O1 and O3 cannot do it. You can argue that all those patterns occur in the training set and can be simulate with enough computation. But the "emergence" of computation capability from scaling those "stupid" algorithms is very strange. Maybe, we still don't know how far they can get with scaling.

Expand full comment
Dhruva Kashyap's avatar

As someone quite uneducated in alcoholics beverages, I feel quite left out missing Ben's probably clever title. Could someone explain what a Negroni Variation is here?

Expand full comment
John Quiggin's avatar

Another way of looking at Moore's Law. Once you have an exponential-time algorithm, your problem is solved in linear real-time.

But there are lots of problems that don't have an exponential time solution.

Expand full comment
Roy Fox's avatar

Sutton gave an intriguing keynote at RLC last Friday, in which he made his meaning very clear, I think: he wants fully domain-general agents. He wants agents to go out into the world and learn from scratch in any environment without any prior knowledge, and he doesn't care how much compute they need for this. I think that's all he meant, or at least all he'd mean today.

Expand full comment
Cagatay Candan's avatar

A bitter lesson interpretation: As a person in signal processing, we spent years to develop better features for classification problems. Yet, ADABOOST, a meta-algorithm, says that if you have just barely useful features (weak learners) but very many of them; run this simple 4-5 step algorithm on your training data to get a perfect classifier, if there is one. Keep running it even after zero-training error to get even better generalization.

This is quite amazing (unsettling) to especially EE people raised to learn the importance of models say in circuits, electro-magnetics, electronics, control and every other course in the curriculum. There is no engineering without models. This is not only bitter but also sour…

Expand full comment
Maxim Raginsky's avatar

On the one hand, true. On the other hand, there has always been a great deal of work in communications and information theory on universal (i.e., model-free or at least very weakly model-dependent) schemes for compression, error correction, equalization, etc.

Expand full comment
Ben Recht's avatar

I agree with Max here, and also want to be very clear that Sutton's Bitter Lesson is not the one Cagatay is writing about. That said, I do think people in our community did have to come to terms with the limits of "handcrafted" signal processing and well-behaved optimization solvers. I will write about this in a follow up...

Expand full comment
Seth's avatar

Coming from statistical modeling rather than comp sci, I always took the "bitter lesson" to be about scaling 'with something' but not necessarily computing power per se. In statistics, for example, you often want a model that is scalable in the sense that it can incorporate different data sources and types.

This can be 'bitter' in the sense that the most fun model that you'd really like to write is a very complicated one that shows off how clever you are. Sadly, such models tend not to scale well with anything.

Expand full comment