Measures as ends
Some initial thoughts on Kevin Baker's provocative "Context Windows."
Kevin Baker has written an essential critique of the absurd current push to use AI to automate science by pulling back the curtain on what modern science actually is. You really should go read the whole thing. I haven’t been able to get it out of my head, as it succinctly reframes many of the frustrated half-thoughts on this blog. Since I can’t shake his essay, I’m going to blog about it with more half-thoughts. I guess this is what I do.
The proponents of the AI automation of science argue that LLMs will superintelligently come up with amazing new hypotheses, propose new experiments, execute them flawlessly, and write clear analyses of their breakthroughs. This betrays an idealized romantic view of science. Is science the noble pursuit of fundamental truth by lone geniuses in their laboratories? Or is science a bunch of nerds who mail each other PDFs and compute page rank on the cross-references between them to determine worth?
This second perspective offers a dim, science-studies view of the field, but it’s impossible to look at the current state of academic science and not subscribe to at least part of it. We all know people who maximize their h-indices and obsess about acceptance rates on their CVs. We know of citation and collusion rings. We know that universities and companies use Google Scholar metrics in their hiring decisions.
Kevin’s telling of how we got here is fascinating, though not surprising. As academic literature ballooned exponentially after World War II, people like Eugene Garfield constructed high-level views to make sense of it. These new metrics were intended to help give scholars a better sense of the literature. The scientific community decided that these metrics could distinguish good scientists from bad ones. Then everyone started to optimize the metrics.
Kevin artfully argues that this goal displacement, the process by which measures become ends, is an inevitable consequence in bureaucratic systems. He draws a sharp distinction between goal displacement and Goodhart’s law. As Dan Davies likes to remind us, Goodhart’s law is misunderstood: “When a technical point begins to be used as an aphorism, it ceases to be a good technical point.” Goodhart was after the idea that measurements are attempts at extracting information about latent states that aren’t directly measurable. Once the measurement is mispurposed as an optimization target, it fails to provide information about those latent states. Sociologist Robert Merton, who coined the term goal displacement, argued that all measurements become targets in bureaucratic systems. This is far more concerning. And yet it is so obviously true.
Some smart person comes up with a clever scheme to quantify a concept of interest. Everyone then tries to maximize their measurement because this concept was declared interesting. Any measurement that helps bean counters see what’s happening on the ground becomes optimized. In massive bureaucratic systems, there are no good ways to measure, only ways to optimize.
This ruthless capitalist view of human organization is depressing but clarifying.
Once the measure inevitably becomes the optimization target, it not only takes on a new meaning but also changes the nature of behavior. Citations move from pointers to evidence to value judgements about credit assignment. P-values move from estimates of statistical noise to bureaucratic games of approval. New metrics and rules don’t solve this; they just add to a Kafkaesque absurdity.
The “reform” structures proposed to battle overpublication in the machine learning and AI communities clearly illustrate this cycle of absurdity. I’ve written before about NeurIPS forcing people to fill out mandatory paper checklists. CVPR, the leviathan computer vision conference, now requires that every person whose name appears on a paper submitted to the conference serve as a reviewer. IJCAI has decided to charge authors fees just to submit to the conference in the first place. And ICML is greenlighting mass LLM reviewing to encourage an even larger scale cascade of h-indices. Kevin writes, “systems can persist in dysfunction indefinitely, and absurdity is not self-correcting.” The science reformers in AI, faced with an exploding growth in word-count thus far not seen in science, are indeed not self-correcting absurdity. What we see instead is a positive feedback loop of absurdity, where the process becomes more unwieldy and ludicrous every year, while the information becomes increasingly illegible.
However, this system still works for many people. You can write a bunch of meaningless papers at a maximal rate, maximizing the metrics, and only find success. In our current irrational exuberance, those with the upper echelons of h-index now become billionaires just for having a high h-index.
Which brings us back to automated science. LLMs already replicate and accelerate the irrational bureaucratic structures of science. They help us write more papers faster with better English. They let us flood our preprint servers with spam that juice up people’s citation counts. They are ideal for generating incremental results by optimizing intellectual frames in which we are already comfortable.
Is there a way out? Though a common question on the blog, the answer is still not clear to me. Certainly, identifying the absurdity is part of the process, even if the inertia of these exponentially metastasizing systems is too much to pull back from its edge. Everyone knows we should write fewer papers, and yet it feels like career suicide to do so. Sometimes unstable systems can’t be stabilized, and all we can do is let the explosion burn itself out.


IME, the one sensible way to prevent bureaucratic optimization is to satisfice (draw a minimum threshold for receiving budget) and then randomize (to actively regularize against optimization).