I thought the Baker piece was fascinating and appreciated this. I'm also noticing the extent to which, even though this goal displacement is obviously something that's happened and is a huge deal, it's far from totalizing --- I still encounter people every day who care deeply about knowing things, advancing knowledge, and sharing that knowledge with others, and have only given a little to the bureaucratic pressures.
Also, people adopted those proxy goals for what were, in fact, pretty good reasons at the time! And it is still certainly possible to do good work, and many people do do good work, even while they pursue those proxy goals. But proxy goals outlive their usefulness as a) the low hanging-fruit around those goals get picked, and b) people get better and better at hitting the proxy goal without actually picking any fruit.
The bull case for LLMs is that they are SO EFFICIENT at foraging around those proxy goals that we collectively have no choice but to move camp towards a newer, more fruitful set of proxy goals.
Then we will pick those clean, and our children's children can all have this same conversation again in 100 years.
"Is science the noble pursuit of fundamental truth by lone geniuses in their laboratories? Or is science a bunch of nerds who mail each other PDFs and compute page rank on the cross-references between them to determine worth?"
It's the noble pursuit of fundamental truth by a bunch of nerds who mail each other PDFs, of course.
It is quite an interesting piece, hard to disagree with the incentive system being fundamentally problematic. On the other hand, one can take a step back and look at the process as a whole. In the end, the goal of the research effort, at least in scientific domains, is to make progress in science. It is hard not to think that in AI/ML, despite the overwhelming mess, the progress has been truly spectacular. Perhaps we should think of the modern scientific process not as an orderly procession of deep insights but as an evolutionary mess of conflicting and mostly dead-end efforts leading, nevertheless, to progress?
As someone working on llms for science now, I wouldn't make any of your hyperbolic claims. Rather, I think the process of experiment-theory-refine has a lot of drudgery in it which we can get llms to help accelerate / automate / do some of it.
I haven't done proper science in a long time but I did get a phd. At least the science I am a little familiar with involves a loop of throwing a bunch of reasonable sounding alternatives from literature on a relevant problem, often mixing them with each other a bit, to build an intuition of what works well where.
And when you think about this, this sounds a lot like the basic requirements for reform RL to work: you have a bunch of reasonable strategies you can initialize a model to do, applying a reasonable strategy to an in-domain problem has a nonzero success percentage, and mixing and matching pieces of reasonable strategies has a single digit percentage chance of improving something.
Separately from this a lot of problems involved in science in my opinion are constraint satisfaction / search / optimization problems which have the flavor of easy to verify hard to invent, and I think with some dataset / rl environment engineering reform RL is a great tool to get monotonically improving sequences of solutions to problems of this shape.
So no, I don't care whether LLMs will "do science" in an ontological definition, but I think we can use them to productively accelerate the process of solving some scientific problems, in a way that we couldn't really do five years ago.
What you are working on is also valuable contribution to science, but what Ben and Baker are talking about seems more about the macro picture of doing science. At that point whether a research is good or not doesn’t always have clear measures that can be the starting point of optimization, and they argue that the bureaucratic goal setting and optimization actually distorts the scientific process.
IME, the one sensible way to prevent bureaucratic optimization is to satisfice (draw a minimum threshold for receiving budget) and then randomize (to actively regularize against optimization).
Eli's suggestion is how NSERC works in Canada, and I like this system a lot: If you can argue in your grant application that you did a decent job during the last period, you get renewed for another 5 years. The grant size is so that it will be OK to support a small group. Everyone gets about the same amount (in our field at least; don't know about hard sciences). This is "satisficing". So yes, you could game the system to reach the threshold, but since the threshold is not too high, the incentive to do so is not high. I am not saying this is a perfect system for everything, but for CS it seems to work pretty well.
This is meant to grow the base, not increasing competitiveness (within Canada), so this on its own is not a problem IMHO. Having said this, the CIFAR AI Chair program is a better example of things that could ruin our little paradise because that program is meant to be selective. But even with that, during the last couple of years, the program has been widened quite considerably, which IMHO is the right move.
I haven't read Kevin's analysis yet - only yours. But as I read yours, my question increasingly became whether "bureaucratic" is a good descriptor. Academia, especially in CS, now takes its cues from industry (as you note). This doesn't mean that other parts of the academy are fabulously free of bureaucracy or of some bad incentives; only that the incentives for high h-count just aren't operative because capitalism is far less interested in what they do to provide these high-stakes rewards. (I'm a successful academic in my field and I honestly don't know what my H count is. If I sought more attention and prestige I'd be better off winning a fellowship at a prestigious center--or maybe starting a substack :)
So unsurprisingly y'all are playing an optimization game much like the novelist Ted Chiang attributed to the consulting firm McKinsey--i.e., a game that capitalism finds useful. - or at least legible.
If science is not „noble pursuit of fundamental truth by lone geniuses in their laboratories” then ai should not try to automate it. How much of the current system is worth it?
Goal displacement is an interesting concept in that sometimes displacement is just the thing to reach your goal: the tennis player who focuses less on winning and more on process. It’s tempting to game the metrics, but a scientist who tries mightily to “just do good science” has a good chance (maybe not the best chance given the context) of meeting their inevitably human career goals. And meanwhile, the societal goal of science as a whole will be served.
This may not be THE answer but it's AN answer and (notably) it's the answer that Weizenbaum chose. Work with humanities professors--by which I don't mean, say, philosophers who work on "alignment" or any other posthoc pseudo-solution.
Check out the editor's inaugural introduction to the recent Duke UP journal CRITICAL AI.
I sense some tension between this blog's tone and the tone of "Benchmark studies" as in:
> "This ruthless capitalist view of human organization is depressing but clarifying. Once the measure inevitably becomes the optimization target, it not only takes on a new meaning but also changes the nature of behavior. Citations move from pointers to evidence to value judgements about credit assignment. P-values move from estimates of statistical noise to bureaucratic games of approval. New metrics and rules don’t solve this; they just add to a Kafkaesque absurdity."
vs.
>"Despite statistical arguments declaring it fundamentally flawed, the culture of competitive testing on benchmarks has driven and still drives the engine of what the field defines as progress."
am i drawing a false equivalence between Ml and 'scienctometry'? maybe you can help clarify for me.
Ok great! I read it. I still sense an unresolved tension, like in your last paragraph
> What exactly is it about benchmarks that measure progress? Why is it that certain benchmarks carry more weight than others?
To me this current post suggests that the benchmarks and the ML field's "progress" are just whatever the "ruthless capitalist" engine incentivizes, which seems...not great for a field, unless one pre-commits to some truly naive positivism. Am I on the right track?
One thing I’ve wondered is if there’s a way to force all citations in peer-reviewed papers to actually point back to the true origin of the idea that the author is referring to. It’s easy for a lazy author to cite the first paper they find expressing something close to what they want to express. If they were forced to not be lazy and go back and find the actually correct citation, it may reduce citation counts of incremental or slop papers, and consequently the incentive to publish those.
Before citation indices, we just stolidly worked through references at the end of a paper and hoped to find useful insight there. The problem is that scientists thought lots of references gave the paper more "gravitas" and showed more "scholarship". Baker seems to do teh same, as his text does not tag these references, leaving it as a puzzle to check them. Maybe just use the references by authors named in the text as teh important ones?
LLMs and AI generally have a role to play, but they will not provide the needed insights to generate new interesting testable hypotheses to move science forward. They can provide some of the "filling in", but they won't solve big questions if these require new thinking to pursue. LLMs as the base will not provide breakthrough science. It isn't even clear if AlphaFold will do that, despite the triumph of protein folding it accomplished. Will it lead to new medications or advance the science of how medications work? Or will it be a cul-de-sac like the 3D chemical design software in the 1990s proved? In the past, computation solved certain math problems by brute force, e.g., the colors needed for maps. AIs may be similar, but they do not pose new problems or have the insight to show how one math domain maps to another, or provide new ideas that have value in extending mathematics.
It would be an interesting test to train AIs on curated knowledge and data up to some date and see whether actual, subsequent new science or math (or any new idea) that emerged would be discovered by the AI from that work. My guess is not. If that hypothesis is correct, then the claims of AIs solving many of our problems today and developing new science and engineering are just hype, and nothing more.
There is also evidence that using AI makes human minds less powerful. If so, we could be on the road to Idiocracy. My reference SciFi short story is C. M. Kornbluth's "The Little Black Bag" (1950). In this story, human genetics leads to lowered IQs that advanced technology mitigates. In our case, it is advanced "thinking" technology that damages our minds and makes us less intelligent and capable. This may be similar to teh claims that settled populations using agriculture domesticated us and reduced the brain-to-body weight ratio compared to our hunter-gatherer ancestors. Yes, our civilizational lifestyle created new technologies that allowed us to do more, but has our raw IQ declined (as measured by the context needed)?
Long story short, is AI just a new level of software akin to a spreadsheet, or does it really represent a new technology that boosts our thinking beyond just assimilating, summarizing, and report writing through brute force computation? Is the issue of performance measurement becoming a goal just a red herring about teh sociology of teh practice of science, rather than the goals of science to produce more useful knowledge that extends our power over nature?
Your point on the aphorism being about the gap between the metric and a latent state helped me resolve the unease I had felt for a while when trying to reconcile it with the fact that some metrics do seem to work well, e.g. effectiveness of a pharmacological intervention when measured through an RCT is a good metric because we have formal arguments to show how such a metric ought to be closely related to change in the states we care about.
I thought the Baker piece was fascinating and appreciated this. I'm also noticing the extent to which, even though this goal displacement is obviously something that's happened and is a huge deal, it's far from totalizing --- I still encounter people every day who care deeply about knowing things, advancing knowledge, and sharing that knowledge with others, and have only given a little to the bureaucratic pressures.
Also, people adopted those proxy goals for what were, in fact, pretty good reasons at the time! And it is still certainly possible to do good work, and many people do do good work, even while they pursue those proxy goals. But proxy goals outlive their usefulness as a) the low hanging-fruit around those goals get picked, and b) people get better and better at hitting the proxy goal without actually picking any fruit.
The bull case for LLMs is that they are SO EFFICIENT at foraging around those proxy goals that we collectively have no choice but to move camp towards a newer, more fruitful set of proxy goals.
Then we will pick those clean, and our children's children can all have this same conversation again in 100 years.
"Is science the noble pursuit of fundamental truth by lone geniuses in their laboratories? Or is science a bunch of nerds who mail each other PDFs and compute page rank on the cross-references between them to determine worth?"
It's the noble pursuit of fundamental truth by a bunch of nerds who mail each other PDFs, of course.
Dialectics happening in the comment section.
It is quite an interesting piece, hard to disagree with the incentive system being fundamentally problematic. On the other hand, one can take a step back and look at the process as a whole. In the end, the goal of the research effort, at least in scientific domains, is to make progress in science. It is hard not to think that in AI/ML, despite the overwhelming mess, the progress has been truly spectacular. Perhaps we should think of the modern scientific process not as an orderly procession of deep insights but as an evolutionary mess of conflicting and mostly dead-end efforts leading, nevertheless, to progress?
"this system still works for many people." is definitely why academia is going to take a long time to change.
Thanks for highlighting the piece by Baker! Thanks for at least thinking of ways to evolve things too.
As someone working on llms for science now, I wouldn't make any of your hyperbolic claims. Rather, I think the process of experiment-theory-refine has a lot of drudgery in it which we can get llms to help accelerate / automate / do some of it.
I haven't done proper science in a long time but I did get a phd. At least the science I am a little familiar with involves a loop of throwing a bunch of reasonable sounding alternatives from literature on a relevant problem, often mixing them with each other a bit, to build an intuition of what works well where.
And when you think about this, this sounds a lot like the basic requirements for reform RL to work: you have a bunch of reasonable strategies you can initialize a model to do, applying a reasonable strategy to an in-domain problem has a nonzero success percentage, and mixing and matching pieces of reasonable strategies has a single digit percentage chance of improving something.
Separately from this a lot of problems involved in science in my opinion are constraint satisfaction / search / optimization problems which have the flavor of easy to verify hard to invent, and I think with some dataset / rl environment engineering reform RL is a great tool to get monotonically improving sequences of solutions to problems of this shape.
So no, I don't care whether LLMs will "do science" in an ontological definition, but I think we can use them to productively accelerate the process of solving some scientific problems, in a way that we couldn't really do five years ago.
What you are working on is also valuable contribution to science, but what Ben and Baker are talking about seems more about the macro picture of doing science. At that point whether a research is good or not doesn’t always have clear measures that can be the starting point of optimization, and they argue that the bureaucratic goal setting and optimization actually distorts the scientific process.
IME, the one sensible way to prevent bureaucratic optimization is to satisfice (draw a minimum threshold for receiving budget) and then randomize (to actively regularize against optimization).
Kevin's point, which is solid, is that there is no way to prevent bureaucratic optimization. Any system of reform creates new means and ends.
Eli's suggestion is how NSERC works in Canada, and I like this system a lot: If you can argue in your grant application that you did a decent job during the last period, you get renewed for another 5 years. The grant size is so that it will be OK to support a small group. Everyone gets about the same amount (in our field at least; don't know about hard sciences). This is "satisficing". So yes, you could game the system to reach the threshold, but since the threshold is not too high, the incentive to do so is not high. I am not saying this is a perfect system for everything, but for CS it seems to work pretty well.
Canada is certainly not immune to considering changing its academic optimization targets. I'm curious what you think about this investment. https://www.canada.ca/en/innovation-science-economic-development/news/2025/12/government-of-canada-launches-new-initiative-to-recruit-world-leading-researchers.html
This is meant to grow the base, not increasing competitiveness (within Canada), so this on its own is not a problem IMHO. Having said this, the CIFAR AI Chair program is a better example of things that could ruin our little paradise because that program is meant to be selective. But even with that, during the last couple of years, the program has been widened quite considerably, which IMHO is the right move.
I haven't read Kevin's analysis yet - only yours. But as I read yours, my question increasingly became whether "bureaucratic" is a good descriptor. Academia, especially in CS, now takes its cues from industry (as you note). This doesn't mean that other parts of the academy are fabulously free of bureaucracy or of some bad incentives; only that the incentives for high h-count just aren't operative because capitalism is far less interested in what they do to provide these high-stakes rewards. (I'm a successful academic in my field and I honestly don't know what my H count is. If I sought more attention and prestige I'd be better off winning a fellowship at a prestigious center--or maybe starting a substack :)
So unsurprisingly y'all are playing an optimization game much like the novelist Ted Chiang attributed to the consulting firm McKinsey--i.e., a game that capitalism finds useful. - or at least legible.
(Note that when he speaks of "AI" in this article, he really means ML- not gen AI or AI for science). https://www.newyorker.com/science/annals-of-artificial-intelligence/will-ai-become-the-new-mckinsey
Finally, for a super brilliant article on the problems of AI for science, read this article by two anthropologists:
https://www.nature.com/articles/s41586-024-07146-0
If science is not „noble pursuit of fundamental truth by lone geniuses in their laboratories” then ai should not try to automate it. How much of the current system is worth it?
Goal displacement is an interesting concept in that sometimes displacement is just the thing to reach your goal: the tennis player who focuses less on winning and more on process. It’s tempting to game the metrics, but a scientist who tries mightily to “just do good science” has a good chance (maybe not the best chance given the context) of meeting their inevitably human career goals. And meanwhile, the societal goal of science as a whole will be served.
This may not be THE answer but it's AN answer and (notably) it's the answer that Weizenbaum chose. Work with humanities professors--by which I don't mean, say, philosophers who work on "alignment" or any other posthoc pseudo-solution.
Check out the editor's inaugural introduction to the recent Duke UP journal CRITICAL AI.
https://read.dukeupress.edu/critical-ai/article/doi/10.1215/2834703X-10734016/382460/Editor-s-Introduction-Humanities-in-the-Loop
I sense some tension between this blog's tone and the tone of "Benchmark studies" as in:
> "This ruthless capitalist view of human organization is depressing but clarifying. Once the measure inevitably becomes the optimization target, it not only takes on a new meaning but also changes the nature of behavior. Citations move from pointers to evidence to value judgements about credit assignment. P-values move from estimates of statistical noise to bureaucratic games of approval. New metrics and rules don’t solve this; they just add to a Kafkaesque absurdity."
vs.
>"Despite statistical arguments declaring it fundamentally flawed, the culture of competitive testing on benchmarks has driven and still drives the engine of what the field defines as progress."
am i drawing a false equivalence between Ml and 'scienctometry'? maybe you can help clarify for me.
Yes, I wrote about this tension in a longer piece about frictionless reproducibility:
https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1
Competitive testing is the engine of the field, but its fuel has high human costs.
Actually this reminds me of a lecture I saw a long while ago at the Simons Institute about the darpa "common task method", probably you know of it: https://www.simonsfoundation.org/event/reproducible-research-and-the-common-task-method/
Ok great! I read it. I still sense an unresolved tension, like in your last paragraph
> What exactly is it about benchmarks that measure progress? Why is it that certain benchmarks carry more weight than others?
To me this current post suggests that the benchmarks and the ML field's "progress" are just whatever the "ruthless capitalist" engine incentivizes, which seems...not great for a field, unless one pre-commits to some truly naive positivism. Am I on the right track?
One thing I’ve wondered is if there’s a way to force all citations in peer-reviewed papers to actually point back to the true origin of the idea that the author is referring to. It’s easy for a lazy author to cite the first paper they find expressing something close to what they want to express. If they were forced to not be lazy and go back and find the actually correct citation, it may reduce citation counts of incremental or slop papers, and consequently the incentive to publish those.
Before citation indices, we just stolidly worked through references at the end of a paper and hoped to find useful insight there. The problem is that scientists thought lots of references gave the paper more "gravitas" and showed more "scholarship". Baker seems to do teh same, as his text does not tag these references, leaving it as a puzzle to check them. Maybe just use the references by authors named in the text as teh important ones?
LLMs and AI generally have a role to play, but they will not provide the needed insights to generate new interesting testable hypotheses to move science forward. They can provide some of the "filling in", but they won't solve big questions if these require new thinking to pursue. LLMs as the base will not provide breakthrough science. It isn't even clear if AlphaFold will do that, despite the triumph of protein folding it accomplished. Will it lead to new medications or advance the science of how medications work? Or will it be a cul-de-sac like the 3D chemical design software in the 1990s proved? In the past, computation solved certain math problems by brute force, e.g., the colors needed for maps. AIs may be similar, but they do not pose new problems or have the insight to show how one math domain maps to another, or provide new ideas that have value in extending mathematics.
It would be an interesting test to train AIs on curated knowledge and data up to some date and see whether actual, subsequent new science or math (or any new idea) that emerged would be discovered by the AI from that work. My guess is not. If that hypothesis is correct, then the claims of AIs solving many of our problems today and developing new science and engineering are just hype, and nothing more.
There is also evidence that using AI makes human minds less powerful. If so, we could be on the road to Idiocracy. My reference SciFi short story is C. M. Kornbluth's "The Little Black Bag" (1950). In this story, human genetics leads to lowered IQs that advanced technology mitigates. In our case, it is advanced "thinking" technology that damages our minds and makes us less intelligent and capable. This may be similar to teh claims that settled populations using agriculture domesticated us and reduced the brain-to-body weight ratio compared to our hunter-gatherer ancestors. Yes, our civilizational lifestyle created new technologies that allowed us to do more, but has our raw IQ declined (as measured by the context needed)?
Long story short, is AI just a new level of software akin to a spreadsheet, or does it really represent a new technology that boosts our thinking beyond just assimilating, summarizing, and report writing through brute force computation? Is the issue of performance measurement becoming a goal just a red herring about teh sociology of teh practice of science, rather than the goals of science to produce more useful knowledge that extends our power over nature?
"countervailing power" -- does that explain Jay Bhattacharya and Vinay Prasad as the new heads of NHS research?
Your point on the aphorism being about the gap between the metric and a latent state helped me resolve the unease I had felt for a while when trying to reconcile it with the fact that some metrics do seem to work well, e.g. effectiveness of a pharmacological intervention when measured through an RCT is a good metric because we have formal arguments to show how such a metric ought to be closely related to change in the states we care about.