Savage's shadow looms large. "Mathematically rational behavior" is a small-world notion. We live in a large world. When LLMs respond to and generate natural language, they also operate there. In our large word, "rational behavior" is still vaguely pointing at something, but the emphasis is on vaguely: it's often used to analyze whether someone is acting "mathematically rationally" in some particular conjectured projection of the large world into a particular small world, and often the choice of projection is itself contentious.
So of course LLMs aren't "mathematically rational" in the large world, since that's not even a thing. The issue is (as usual) pretending a term that makes sense in the small world carries over uncritically to the large.
One thing I've read that I find so interesting is that it is the Chinese companies that are pursuing much more industrial applications of these systems, in robotics and manufacturing, while it's the hyper capitalists who've either become obsessed with this vague notion of AGI and superintelligence to build ever more complex LLMs or pay lip service to such a goal. I can't quite articulate the thought, but for America, you'd think the actual industrial application would drive the innovation but instead we get Chat GPT 5, which makes the sort of mistake you highlight -- which seems to definitively demonstrate that LLMs are just stochastic parrots and don't understand the text they produce, any more than the image-producing systems understand the images they output.
Personalization, flattery, and selling anxiety were incipient before LLMs, but when these became the business model for the industry the consequences were horrifying.
> For example, they are incredible at making code faster, outperforming any autotuning tool I’ve ever tried.
I've never used an auto-tuning tool so can't comment on this, but I've not found the code generated by LLMs to be particularly efficient. TBH, I find the claims about the programming tools a little mystifying - and it has made me question the competence of many in the industry.
They're good at churning out copy/paste code, and boiler plate.
I'd put that soup of acronyms into the general stack of what I'd call a language machine. Disambiguating the post-training model from the pre-training model is next to impossible with our current API access, and I'm happy to consider what we can learn by considering the entire assembled artifact.
“. . . the very unnatural model of statistically summarizing language and code via maximum likelihood estimation. People don’t do this.”
Are you sure? I’m not saying that in learning to catch a fly ball one’s doing calculus, but it seems to me that we are statistically summarizing (not necessarily properly or correctly) and that we do some type of likelihood estimation. In the worst case, our estimation confirms what we want observed data to be, our most probable way of being right (even when we’re not).
All I was saying there is that humans certainly do not acquire language by maximum likelihood estimation of a probabilistic model with respect to a terascale corpus of tokenized text.
Sure. I agree. I just thought it was interesting how we might acquire language is through some type of trial/error likelihood estimation while dealing with terascale size of information out there. Anyway, you have got some great sentences in this post.
Savage's shadow looms large. "Mathematically rational behavior" is a small-world notion. We live in a large world. When LLMs respond to and generate natural language, they also operate there. In our large word, "rational behavior" is still vaguely pointing at something, but the emphasis is on vaguely: it's often used to analyze whether someone is acting "mathematically rationally" in some particular conjectured projection of the large world into a particular small world, and often the choice of projection is itself contentious.
So of course LLMs aren't "mathematically rational" in the large world, since that's not even a thing. The issue is (as usual) pretending a term that makes sense in the small world carries over uncritically to the large.
One thing I've read that I find so interesting is that it is the Chinese companies that are pursuing much more industrial applications of these systems, in robotics and manufacturing, while it's the hyper capitalists who've either become obsessed with this vague notion of AGI and superintelligence to build ever more complex LLMs or pay lip service to such a goal. I can't quite articulate the thought, but for America, you'd think the actual industrial application would drive the innovation but instead we get Chat GPT 5, which makes the sort of mistake you highlight -- which seems to definitively demonstrate that LLMs are just stochastic parrots and don't understand the text they produce, any more than the image-producing systems understand the images they output.
Personalization, flattery, and selling anxiety were incipient before LLMs, but when these became the business model for the industry the consequences were horrifying.
I wrote something about my (ahem) personal connection to this here https://thenewcuriosityshop.substack.com/p/two-mirrors-of-the-world
> For example, they are incredible at making code faster, outperforming any autotuning tool I’ve ever tried.
I've never used an auto-tuning tool so can't comment on this, but I've not found the code generated by LLMs to be particularly efficient. TBH, I find the claims about the programming tools a little mystifying - and it has made me question the competence of many in the industry.
They're good at churning out copy/paste code, and boiler plate.
I don't think their code is necessarily efficient, but I've had good luck getting them to optimize existing code.
Looking forward to reading the rest in this series.
Just one comment: Shouldn't we be differentiating between the generative model from the post-training processes like RLHF, DPO, RAG, etc?
Aren't these processes akin to "parlor tricks" you mentioned in this post?
I'd put that soup of acronyms into the general stack of what I'd call a language machine. Disambiguating the post-training model from the pre-training model is next to impossible with our current API access, and I'm happy to consider what we can learn by considering the entire assembled artifact.
“. . . the very unnatural model of statistically summarizing language and code via maximum likelihood estimation. People don’t do this.”
Are you sure? I’m not saying that in learning to catch a fly ball one’s doing calculus, but it seems to me that we are statistically summarizing (not necessarily properly or correctly) and that we do some type of likelihood estimation. In the worst case, our estimation confirms what we want observed data to be, our most probable way of being right (even when we’re not).
All I was saying there is that humans certainly do not acquire language by maximum likelihood estimation of a probabilistic model with respect to a terascale corpus of tokenized text.
Sure. I agree. I just thought it was interesting how we might acquire language is through some type of trial/error likelihood estimation while dealing with terascale size of information out there. Anyway, you have got some great sentences in this post.