Savage's shadow looms large. "Mathematically rational behavior" is a small-world notion. We live in a large world. When LLMs respond to and generate natural language, they also operate there. In our large word, "rational behavior" is still vaguely pointing at something, but the emphasis is on vaguely: it's often used to analyze whether someone is acting "mathematically rationally" in some particular conjectured projection of the large world into a particular small world, and often the choice of projection is itself contentious.
So of course LLMs aren't "mathematically rational" in the large world, since that's not even a thing. The issue is (as usual) pretending a term that makes sense in the small world carries over uncritically to the large.
One thing I've read that I find so interesting is that it is the Chinese companies that are pursuing much more industrial applications of these systems, in robotics and manufacturing, while it's the hyper capitalists who've either become obsessed with this vague notion of AGI and superintelligence to build ever more complex LLMs or pay lip service to such a goal. I can't quite articulate the thought, but for America, you'd think the actual industrial application would drive the innovation but instead we get Chat GPT 5, which makes the sort of mistake you highlight -- which seems to definitively demonstrate that LLMs are just stochastic parrots and don't understand the text they produce, any more than the image-producing systems understand the images they output.
Personalization, flattery, and selling anxiety were incipient before LLMs, but when these became the business model for the industry the consequences were horrifying.
> For example, they are incredible at making code faster, outperforming any autotuning tool I’ve ever tried.
I've never used an auto-tuning tool so can't comment on this, but I've not found the code generated by LLMs to be particularly efficient. TBH, I find the claims about the programming tools a little mystifying - and it has made me question the competence of many in the industry.
They're good at churning out copy/paste code, and boiler plate.
I'd put that soup of acronyms into the general stack of what I'd call a language machine. Disambiguating the post-training model from the pre-training model is next to impossible with our current API access, and I'm happy to consider what we can learn by considering the entire assembled artifact.
Here's another framing of your question: will a statistical correlation model "trained" (ie, fit) only on data from valid deductive arguments predict or generate only valid deductive arguments? There is a simple case where this is so: the average of a constant is the constant itself, eg, a Monte Carlo method for computing the mean of the "random variable" X=5 will always produce E[X]=5. Interestingly, (i) the mechanical hardware implementing the correlation model is mathematically rational (when functioning properly), (ii) the statistical prediction is mathematically rational (because it is an algorithm), and yet prompting a language correlation model (aka LLM) to "do math" sometimes returns (what we interpret to be) a wrong/invalid/irrational answer. In fact, the language model's response is mathematically rational---in the sense of the statistical prediction's execution on the hardware. However, the specific "training" data that the model combines to produce its response includes instances of invalid deduction (some cases of wrong answers). If we could filter out wrong answers from the data, would the LLM's math be better (ie, more often deductively valid)? Perhaps something is provable here for linear-in-data predictors, which LLMs are not.
I think this framing offers some light on why language correlation models often code well (or usefully). Presumably, the vast majority of code posted on public repos compiles and executes correctly (given the right environment) and thus represents valid deductive arguments. Coding language is more regular than natural language. Nevertheless, there is sufficient regularity in natural language such that the language correlation model compels many human operators---at least as parlor tricks (cf Shannon and NLP in CH5 of The Irrational Decision). Also, see 'The debate over understanding in AI’s large language models' by Mitchell and Krakauer (https://www.pnas.org/doi/10.1073/pnas.2215907120).
“. . . the very unnatural model of statistically summarizing language and code via maximum likelihood estimation. People don’t do this.”
Are you sure? I’m not saying that in learning to catch a fly ball one’s doing calculus, but it seems to me that we are statistically summarizing (not necessarily properly or correctly) and that we do some type of likelihood estimation. In the worst case, our estimation confirms what we want observed data to be, our most probable way of being right (even when we’re not).
All I was saying there is that humans certainly do not acquire language by maximum likelihood estimation of a probabilistic model with respect to a terascale corpus of tokenized text.
Sure. I agree. I just thought it was interesting how we might acquire language is through some type of trial/error likelihood estimation while dealing with terascale size of information out there. Anyway, you have got some great sentences in this post.
Savage's shadow looms large. "Mathematically rational behavior" is a small-world notion. We live in a large world. When LLMs respond to and generate natural language, they also operate there. In our large word, "rational behavior" is still vaguely pointing at something, but the emphasis is on vaguely: it's often used to analyze whether someone is acting "mathematically rationally" in some particular conjectured projection of the large world into a particular small world, and often the choice of projection is itself contentious.
So of course LLMs aren't "mathematically rational" in the large world, since that's not even a thing. The issue is (as usual) pretending a term that makes sense in the small world carries over uncritically to the large.
One thing I've read that I find so interesting is that it is the Chinese companies that are pursuing much more industrial applications of these systems, in robotics and manufacturing, while it's the hyper capitalists who've either become obsessed with this vague notion of AGI and superintelligence to build ever more complex LLMs or pay lip service to such a goal. I can't quite articulate the thought, but for America, you'd think the actual industrial application would drive the innovation but instead we get Chat GPT 5, which makes the sort of mistake you highlight -- which seems to definitively demonstrate that LLMs are just stochastic parrots and don't understand the text they produce, any more than the image-producing systems understand the images they output.
Personalization, flattery, and selling anxiety were incipient before LLMs, but when these became the business model for the industry the consequences were horrifying.
I wrote something about my (ahem) personal connection to this here https://thenewcuriosityshop.substack.com/p/two-mirrors-of-the-world
> For example, they are incredible at making code faster, outperforming any autotuning tool I’ve ever tried.
I've never used an auto-tuning tool so can't comment on this, but I've not found the code generated by LLMs to be particularly efficient. TBH, I find the claims about the programming tools a little mystifying - and it has made me question the competence of many in the industry.
They're good at churning out copy/paste code, and boiler plate.
I don't think their code is necessarily efficient, but I've had good luck getting them to optimize existing code.
Looking forward to reading the rest in this series.
Just one comment: Shouldn't we be differentiating between the generative model from the post-training processes like RLHF, DPO, RAG, etc?
Aren't these processes akin to "parlor tricks" you mentioned in this post?
I'd put that soup of acronyms into the general stack of what I'd call a language machine. Disambiguating the post-training model from the pre-training model is next to impossible with our current API access, and I'm happy to consider what we can learn by considering the entire assembled artifact.
Here's another framing of your question: will a statistical correlation model "trained" (ie, fit) only on data from valid deductive arguments predict or generate only valid deductive arguments? There is a simple case where this is so: the average of a constant is the constant itself, eg, a Monte Carlo method for computing the mean of the "random variable" X=5 will always produce E[X]=5. Interestingly, (i) the mechanical hardware implementing the correlation model is mathematically rational (when functioning properly), (ii) the statistical prediction is mathematically rational (because it is an algorithm), and yet prompting a language correlation model (aka LLM) to "do math" sometimes returns (what we interpret to be) a wrong/invalid/irrational answer. In fact, the language model's response is mathematically rational---in the sense of the statistical prediction's execution on the hardware. However, the specific "training" data that the model combines to produce its response includes instances of invalid deduction (some cases of wrong answers). If we could filter out wrong answers from the data, would the LLM's math be better (ie, more often deductively valid)? Perhaps something is provable here for linear-in-data predictors, which LLMs are not.
I think this framing offers some light on why language correlation models often code well (or usefully). Presumably, the vast majority of code posted on public repos compiles and executes correctly (given the right environment) and thus represents valid deductive arguments. Coding language is more regular than natural language. Nevertheless, there is sufficient regularity in natural language such that the language correlation model compels many human operators---at least as parlor tricks (cf Shannon and NLP in CH5 of The Irrational Decision). Also, see 'The debate over understanding in AI’s large language models' by Mitchell and Krakauer (https://www.pnas.org/doi/10.1073/pnas.2215907120).
“. . . the very unnatural model of statistically summarizing language and code via maximum likelihood estimation. People don’t do this.”
Are you sure? I’m not saying that in learning to catch a fly ball one’s doing calculus, but it seems to me that we are statistically summarizing (not necessarily properly or correctly) and that we do some type of likelihood estimation. In the worst case, our estimation confirms what we want observed data to be, our most probable way of being right (even when we’re not).
All I was saying there is that humans certainly do not acquire language by maximum likelihood estimation of a probabilistic model with respect to a terascale corpus of tokenized text.
Sure. I agree. I just thought it was interesting how we might acquire language is through some type of trial/error likelihood estimation while dealing with terascale size of information out there. Anyway, you have got some great sentences in this post.