Bah, I made a mistake. In Tuesday’s post, I tried to take some poetic justice and propose a broader definition of “peer review” to encompass the complexity of the system of academic knowledge curation. Unfortunately, almost all of the feedback I received latched onto the more common usage, the flawed system we use to vet which manuscripts get published.1
Anyway, that’s my bad. Loaded terms can’t get overloaded definitions. I should have instead called academia a “Bureaucracy of Expertise.” Peer review could then maintain its original common definition as the subsystem of complex, obfuscated, administrative rules, procedures, and paperwork that go into accepting and rejecting papers.
Fine, you all win. And if we’re going with that definition, I won't defend peer review. It’s not defensible. However, I will use it as an interesting case study to exemplify how change happens (very, very slowly) in The Academic Bureau of Expertise.
If you don’t know how pre-publication peer review is supposed to work, it’s like this. An academic team writes a paper and then sends it to some venue to be published. There, an editor skims the paper and finds three people to write reviews of the submission. Those three people write something about the paper and send it back to the editor. Based on what they write, the editor decides whether or not to publish it. The editor then sends their decision and the reviews, with the reviewer names redacted to protect the innocent, back to the academic team.
This is the system we have, and no one is happy with it. I fully reject that we can repurpose Churchill and say, “Our system of pre-publication manuscript review is terrible, but it’s better than everything else.” Others have eloquently written elsewhere that this is a failed post-war experiment, and we’d be best off abandoning it as part of any academic reform. I highly recommend Adam Mastroianni’s viral screed for an accessible introduction. Everyone knows what it doesn’t do: it doesn’t elevate great work, it doesn’t catch errors, it doesn’t catch fraud, it is susceptible to collusion rings. No one has a good argument for what it does well, and there are a trillion other ways to get feedback on your research. Those who say there is no better alternative don’t provide evidence when pressed.
Besides, how would you prove it's the best system in the first place? Given the current provisions in the Acceptable Procedures Manual of The Academic Bureau of Expertise, you would run an RCT, of course! Evidence-based academia. Someone would set up a contrived randomized controlled experiment and show that some particular group fares better with one particular setup of the peer review system when compared to another. People have already conducted many such studies, and they don’t tell us anything. I saw a new RCT yesterday claiming that academics can’t even evaluate reviews of their own paper. Part of the problem with academia is that we have blessed this idea of RCTs as a way of sensemaking in social science, and it simply doesn’t work.
Importantly, Paul Beame writes correctly that these RCTs have virtually no external validity. Generalizing from machine learning conferences to everything else is a fool’s errand. I couldn’t agree more. Last month, our AI Winter conference received thirty thousand submissions! Our Summer and Spring conferences receive similar numbers. Reviewers are supposed to evaluate these submissions to determine whether they deserve to be considered “top-tier” publications. But they only get six weeks to review six or more papers. Most communities would be horrified by our system, and rightfully so. With the way the marketplace of ideas now functions in machine learning, literally in the bubbly capitalist marketplace, it’s unclear why we keep this sideshow of review cycles that consume a third of the lives of graduate students (and the many undergrads who are reviewing too).
A more organic fix that wouldn’t require explicit rule changes would be to devalue the currency. We could collectively just agree that NeurIPS and ICML papers are valueless. That they no longer count towards any of the advancement steps. In many ways, just as Adam Smith said it should, this is happening. People are grumbling about how the undergrad with five machine learning publications didn’t turn into a good graduate student. Faculty recruiting committees, baffled by candidates with 50 papers on their CVs, often move on to those with clearer, more focused records, often outside the field of artificial intelligence entirely. The self-correction happens slowly. Maybe there will still be a hundred thousand machine learning papers submitted every year, but AI paper count is slowly beginning to be treated as equivalent to the number of tweets a candidate has posted.
A general dissatisfaction with peer review is why social media and preprint servers are winning. As my mad-tweeting friend Dimitris Papailiopoulos frequently points out, the ratio of arxiv papers to peer-reviewed papers of any machine learning researcher is very close to one. We do the peer-reviewed part because we “have to,” not because anyone thinks it matters for anything related to knowledge dissemination. It has become divorced from the aforementioned marketplace of ideas and now just serves as a bean on a CV to be counted.
And yet, full openness on social media is not a panacea. We have to be careful as we move away from pre-publication manuscript peer review. The frenzied, decentralized sloshing around with systems to maximize their visibility in the market of ideas doesn’t always take us to good outcomes.
For example, there are downsides to every paper needing to be accompanied by a Twitter thread. Kevin Munger brilliantly lays out the argument against academic Twitter.
Scientists were tricked by the logic of the platform into having a single, public-facing account where they did three very different kinds of things:
Had intense debates about the status of scientific research, often criticizing scientific practice.
Shared their own scientific findings with the public, towards the goal of “impact” so prized by today’s hiring committees and grant awarding institutions.
Spouted off about their private political opinions.
Each of these things is a perfectly reasonable thing to be doing — just not all of them, at once, from the same account.
I can’t recommend Kevin’s post enough. You can be all for open science and believe academic Twitter was a mistake. Academics making sausage and posting their whole selves to Twitter was, for the most part, a very bad look.
More importantly, Kevin articulates why all of these changes to academic systems have to be made organically. Making explicit rules and then not providing additional resources ends up strangling academic work. More mandatory paperwork—whether it be forcing people to write ethics statements, funding statements, or preregistration plans—with no increase (in fact, a decrease!) in resources isn’t going to lead to more creativity and breakthroughs. And worse, it provides openings for opportunists to exert authoritarian control.
Notable exceptions were the two awesome Robs Nelson and Nowak.
For the record, I liked your previous post. Broadening the scope of peer review to encompass all of the ways in which peers review each other was thought-provoking.
"it doesn’t catch errors"
The description in Adam Mastroianni’s blog and the papers cited within may be misleading and underestimating the performance. The experiments in those papers (and others as well) insert *multiple* major errors in each paper. The papers and the blog then report the *fraction of errors* caught across all reviewers.
However, it is conceivable that when a reviewer reads a paper that they find badly flawed (e.g., paper says it is an RCT but in reality it is not), the reviewer may simply report this as a bad paper and stop reading ahead (to save their own time), thereby missing the subsequent errors. The fraction of errors found metric would thus be low.
An alternative metric is to check what fraction of reviews detected at least one error. I was able to get the dataset of the Schroter et al. 2008 paper from the very helpful Sara Schroter. It turns out that 90.94% of the reviews detect at least one of the major errors. That isn't too bad.
PS: I too am not in favor of the ML/AI conference review approach (https://researchonresearch.blog/2024/06/21/building-walls-in-academia-and-making-researchers-pay-for-it/), but my "positive" comments above are to add some clarity in interpreting of the peer-review literature.