21 Comments
User's avatar
Kshitij Parikh's avatar

"I could write another thousand words about the confusing examples of parlor games and supervised learning that Sutton presents as evidence of what taught him his lesson" -> Please do.

Expand full comment
Yuval Rabani's avatar

"Everyone in computer science has bought into the pact of evaporating oceans for better coding assistants."

That's because nobody in computer science likes coding, or teaching programming courses. Evaporating an ocean after teaching a class using Java is a mild reaction.

But some of us still think that better scaling means something other than exponential growth rate of energy bills or Nvidia's valuation.

Expand full comment
Ben Recht's avatar

Yes. There are dozens of us! :)

Expand full comment
Yuval Rabani's avatar

"It is not in numbers, but in unity, that our great strength lies; yet our present numbers are sufficient to repel the force of all the world."

Expand full comment
Padarn Wilson's avatar

I like this post and this reply: but do you mind elaborating on what you mean by “scaling”? I.e what is that alternative?

Expand full comment
Yuval Rabani's avatar

The traditional meaning of scalability is computational solutions whose resource requirements grow modestly with problem size. Quantitatively, this could mean different things in different contexts.

Now, we've actually had exponential growth of resources for decades, through Moore's law. That must have contributed to scalability. Moore's law started (1965) at doubling the number of transistors on a chip each year (with Intel's CEO predicting doubling every 18 months in the same year). It supposedly implies a similar increase in speed of computation and memory while decreasing relative cost of computation and memory, and this indeed has happened. Since 1965, the cost of computing (per FLOP) and the cost of memory (per byte) decreased by roughly 8 to 10 orders of magnitude (100 million to 10 billion, and 2^30 is roughly a billion). Energy efficiency doubled every 18 months (that's called Koomey's law: #FLOPS per joule doubles every 18 months).

What's happening now appears to contradict this historic trend. It seems that Moore's law and Koomey's law are broken, so scaling means spending more money on hardware and more energy on computing. I don't know if this implies that the algorithms are not scalable. Linear time / space means that if you double the size of the input, you double the resources needed. Stored data doubles every few years. LLM model size also grows at a terrific rate (but there are only a few years to compare). This can't go on forever, or for a very long time (using more and more data to train larger and larger models), especially at a time when Moore's law has slowed down considerably. The money supply's limited, and so is the energy supply. They grow, but at a much smaller rate.

Expand full comment
Padarn Wilson's avatar

Thanks, very clear

Expand full comment
Yuval Rabani's avatar

Just to fix a typing error: I meant to say that in 1965 Moore predicted doubling each year, and in 1975 revised it to doubling every two years. Intel's CEO in 1975 was quoted to say doubling every 18 months.

Expand full comment
Cagatay Candan's avatar

A bitter lesson interpretation: As a person in signal processing, we spent years to develop better features for classification problems. Yet, ADABOOST, a meta-algorithm, says that if you have just barely useful features (weak learners) but very many of them; run this simple 4-5 step algorithm on your training data to get a perfect classifier, if there is one. Keep running it even after zero-training error to get even better generalization.

This is quite amazing (unsettling) to especially EE people raised to learn the importance of models say in circuits, electro-magnetics, electronics, control and every other course in the curriculum. There is no engineering without models. This is not only bitter but also sour…

Expand full comment
Maxim Raginsky's avatar

On the one hand, true. On the other hand, there has always been a great deal of work in communications and information theory on universal (i.e., model-free or at least very weakly model-dependent) schemes for compression, error correction, equalization, etc.

Expand full comment
Ben Recht's avatar

I agree with Max here, and also want to be very clear that Sutton's Bitter Lesson is not the one Cagatay is writing about. That said, I do think people in our community did have to come to terms with the limits of "handcrafted" signal processing and well-behaved optimization solvers. I will write about this in a follow up...

Expand full comment
Seth's avatar

Coming from statistical modeling rather than comp sci, I always took the "bitter lesson" to be about scaling 'with something' but not necessarily computing power per se. In statistics, for example, you often want a model that is scalable in the sense that it can incorporate different data sources and types.

This can be 'bitter' in the sense that the most fun model that you'd really like to write is a very complicated one that shows off how clever you are. Sadly, such models tend not to scale well with anything.

Expand full comment
Joe Jordan's avatar

The bitter lesson is wrong because it doesn't take into account the prifit motive. If anyone can get better scaling by spending more money then there will be very little profit to be made (if the only barrier to entry is money then there is functionally no barrier to entry). Companies tend not to invest in areas where they can't make a profit, and low entry barriers means more competition and thus less profit.

Expand full comment
Matt Hoffman's avatar

> Every theoretical and practical result in reinforcement learning shows that it doesn’t leverage computation.

Can you elaborate on this statement? RL performance improves as a function of number of rollouts, and most of the big RL success stories I can think of involve throwing a ton of computation at doing a huge number of rollouts. What am I missing?

Expand full comment
Ben Recht's avatar

Enumerating pure random evaluations also improves as you increase the number of random guesses. Is that leveraging computation?

I will write more about reinforcement learning at some point... I have written about it before, and nothing has changed in the ensuing decade.

https://archives.argmin.net/2018/06/25/outsider-rl/

https://arxiv.org/abs/1806.09460

But the ethos persists.

Expand full comment
Matt Hoffman's avatar

> Enumerating pure random evaluations also improves as you increase the number of random guesses. Is that leveraging computation?

I think most people might say yes? I guess it depends on whether you see compute as the lever or as the force being applied. RL is a bad lever in the sense that it doesn't multiply the effectiveness of your compute (at least in situations where better alternatives exist). But it _uses_ compute as a lever that multiplies the effectiveness of simple methods, and in that sense RL can be very high-leverage.

I suspect that this "utilize" sense of "leverage" is what Sutton had in mind, both because it makes more sense (IMO) and because it seems more consistent with its usage in phrases like "researchers seek to leverage their human knowledge of the domain".

Expand full comment
Ben Recht's avatar

I really don't think most people would say yes. Fortunately or unfortunately, I don't know how to make a poll in substack.

If this is what Sutton had in mind, then the lesson was only bitter because he refused to listen to anyone who had looked at machine learning before 2019. 1 nearest neighbors also leverages computation in the way you describe, and it even comes with proofs of optimality from the 1960s.

To be fair, that might be the case: the fact that he uses chess as an example suggests that we was after something very alien to me. The algorithm that beat Kasparov was a variant of the one proposed by Claude Shannon in 1950.

Expand full comment
Dhruva Kashyap's avatar

As someone quite uneducated in alcoholics beverages, I feel quite left out missing Ben's probably clever title. Could someone explain what a Negroni Variation is here?

Expand full comment
Ben Recht's avatar

A negroni is a cocktail, equal parts gin, sweet vermouth, and campari.

A negroni variation is a cocktail with three components: a high-alcohol spirit, a fortified wine or liqueur, and a bitter liquor (called an amaro in Italian).

The bitter component explains the title.

Negroni variations are also my personal favorite genre of cocktails.

Expand full comment
John Quiggin's avatar

Another way of looking at Moore's Law. Once you have an exponential-time algorithm, your problem is solved in linear real-time.

But there are lots of problems that don't have an exponential time solution.

Expand full comment
Roy Fox's avatar

Sutton gave an intriguing keynote at RLC last Friday, in which he made his meaning very clear, I think: he wants fully domain-general agents. He wants agents to go out into the world and learn from scratch in any environment without any prior knowledge, and he doesn't care how much compute they need for this. I think that's all he meant, or at least all he'd mean today.

Expand full comment