If you’re out there on the socials, everyone is worried that the GPT5 release shows we’re finally “hitting a wall” with LLM progress. My dear friend and Twitter addict Dimitris Papailiopoulos exclaimed:
“It is funny that we've accepted the diminishing returns of scaling laws as the only roadmap to intelligence. And by funny I mean sad.”
Shital Shah devastatingly replied to Dimitris with the pedagogical thought experiment:
“It's year 2018. You walk into Algorithms and Data Structures class. You tell students to just use whatever algorithm first comes to their mind, throw in a ton of compute and call it scaling. ‘That's a bitter lesson for you all’, you say, and leave the classroom.”
Shital’s tweet had me reeling for multiple reasons. It’s August, so I have been thinking about what I am going to teach this fall semester. I’m a bit worried I agreed more than disagreed with his assessment of the future of computer science pedagogy. To explain why, I first need to unpack the bitter lesson meme and how it has captured computer science.
In case you don’t know, “The Bitter Lesson” is the title of a short blog post by Turing Laureate Richard Sutton from 2019. The bitter lesson is one of those researcher magic mirrors. People see whatever they want to see. Like much of Sutton’s writing, it’s vague and nontechnical, allowing for many interpretations. Shital motivated me to go back and stare in that mirror, actually reading it for the first time in years.
Sutton begins:
“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. “
What does it mean to leverage computation? He never says. What he does say is “The ultimate reason for this is Moore's law,...” which is an interesting lesson for us in 2025 since we’re well past Moore’s Law actually delivering. Moore’s Law used to mean that if you had an algorithm that needed 2x your current computing, you just had to wait two years, and that computing would appear magically at the Apple Store. Today, we’re in the interesting phase where spending on AI infrastructure has to grow exponentially to keep up with the expectations set by Moore’s Law. Given that we’re in such a competitive bubble driving up NVIDIA stock prices, the costs probably have to more than double every two years. That’s a different sort of bitter lesson that, like with all speculative bubbles, we’ll have to deal with eventually.
Sutton continues:
“Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance)”
This passage reveals a salient perspective of old school AI researchers that most people miss when reading the essay. I’ve been engaged in machine learning research for over twenty years. I never thought this way. No one I ever talked to at N(eur)IPS ever thought this way. However, when I was in grad school, “AI” researchers did think this way. I always thought those people were weird (yes, I’m talking about Marvin Minsky and his acolytes).
Indeed, that Sutton’s Lesson was Bitter suggests he wasn’t a fan of computer science. The only intellectual through line in computer science is a faster computer is always a better computer. Every course in computer science, except for HCI and good old fashioned AI, is obsessed with computation being a variable that grows. “Scaling is all you need” is the motto of computer science.
This motto doesn’t mean you use crappy algorithms, but it is why computing researchers spend so much time thinking about scaling. Except in very special resource constrained cases, the goal is always to build algorithms that minimize computation time when presented with larger tasks.
Leaving this particular quirk of thinking aside, the heart of Sutton’s argument is the next passage:
“Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other.”
I’m not sure how you leverage computation without leveraging human knowledge. Certainly, if you look at the language people use in machine learning, it’s loaded with all sorts of speculations about human knowledge. The field still loves anthropomorphically loaded words like intelligence, learning, critic, teacher, student, curriculum, etc. The best evidence that hyperscaling AI researchers aren’t thinking seriously about leveraging computation is their love for reinforcement learning. Every theoretical and practical result in reinforcement learning shows that it doesn’t leverage computation. People like reinforcement learning because of the ethos, not because it scales well.
I could write another thousand words about the confusing examples of parlor games and supervised learning that Sutton presents as evidence of what taught him his lesson. I’d be happy to do this in another post if people are interested. However, I want to come back to the points raised by Shital and Dimitris.
The Bitter Lesson has been used to motivate the scale at all costs mentality behind AI companies. It’s undeniable that scale has been a dominant trend and that computational search is powerful. Devoting effort to scale isn’t the worst thing in the world. I am guilty of devoting a lot of time to this in the 2010s.
It’s only been 6 years since Sutton’s essay, and I’m sure we’ll continue to hyperscale until everyone runs out of money. Maybe we’re in the MMT era of computing and we can keep building bigger data centers and pursuing worse algorithms because that’s what the market rewards. Everyone in computer science has bought into the pact of evaporating oceans for better coding assistants. But scaling is all you need until you hit diminishing returns. Overindexing on the bitter lesson is a commitment to nihilism. I’m not sure this is a sustainable path for industry. It’s assuredly not the way to sustain a field.
"Everyone in computer science has bought into the pact of evaporating oceans for better coding assistants."
That's because nobody in computer science likes coding, or teaching programming courses. Evaporating an ocean after teaching a class using Java is a mild reaction.
But some of us still think that better scaling means something other than exponential growth rate of energy bills or Nvidia's valuation.
I don't mean to be aggressive. But I wonder if you test the most advanced model in openai by yourself with your own problem ? While Sutton's argument unfairly attaches his weird assumption to many researchers, my personal experience with the LLM is that they are improving greatly from GPT 3.5. I am doing computational math with application in various fields. I test the GPT 5 thinking and pro with long proof statements and concepts in stochastic control. It can consistently generalize those math deduction pattern from control problem to HJB and even consider the math constraints and potential weakness of its own arguments. There are some minor mistakes and error. But O1 and O3 cannot do it. You can argue that all those patterns occur in the training set and can be simulate with enough computation. But the "emergence" of computation capability from scaling those "stupid" algorithms is very strange. Maybe, we still don't know how far they can get with scaling.