Discussion about this post

User's avatar
Tom Dietterich's avatar

For the record, I liked your previous post. Broadening the scope of peer review to encompass all of the ways in which peers review each other was thought-provoking.

Expand full comment
Nihar B. Shah's avatar

"it doesn’t catch errors"

The description in Adam Mastroianni’s blog and the papers cited within may be misleading and underestimating the performance. The experiments in those papers (and others as well) insert *multiple* major errors in each paper. The papers and the blog then report the *fraction of errors* caught across all reviewers.

However, it is conceivable that when a reviewer reads a paper that they find badly flawed (e.g., paper says it is an RCT but in reality it is not), the reviewer may simply report this as a bad paper and stop reading ahead (to save their own time), thereby missing the subsequent errors. The fraction of errors found metric would thus be low.

An alternative metric is to check what fraction of reviews detected at least one error. I was able to get the dataset of the Schroter et al. 2008 paper from the very helpful Sara Schroter. It turns out that 90.94% of the reviews detect at least one of the major errors. That isn't too bad.

PS: I too am not in favor of the ML/AI conference review approach (https://researchonresearch.blog/2024/06/21/building-walls-in-academia-and-making-researchers-pay-for-it/), but my "positive" comments above are to add some clarity in interpreting of the peer-review literature.

Expand full comment
19 more comments...

No posts