22 Comments
User's avatar
Michael's avatar

these days as a PhD student, publishing your first NeurIPS/ICLR/ICML paper is like getting your SAG card or something. it makes you eligible for big tech internships. that’s probably worth a couple hundred thousand $ in expectation (low estimate).

those incentives are just too strong, it’s inevitable that things will become strange and distorted. adding more and more bureaucratic process won’t actually help, except perhaps by increasing desk rejections and reducing submissions on the margin.

I don’t really see a way to stop it. It’s public record who publishes at NeurIPS and there’s no way to stop third parties from using that information to make hiring decisions. But the end result is that we’ve all been conscripted to be first-pass recruiters for Google and Meta.

Expand full comment
Ben Recht's avatar

There's no way to stop it, but we could all strive to be more honest about the situation and say it more loudly!

Expand full comment
Michael's avatar

another thing that comes to mind is the paper "Academic journals, incentives, and the quality of peer review: a model" by Zollman et al.

They have some toy economic models for how journals and authors might behave when the "prestige" of a journal is measured by acceptance rates.

"...This results in a strange process whereby journals make peer review worse in an attempt to induce bad papers to submit, but maintain sufficiently good peer review to ensure that a large proportion of those bad papers will probably be rejected."

The "stylized facts" they invent are unfortunately pretty compelling as a description of CS conference incentives. Prestige = acceptance rate is pretty much true, high-variance peer review does incentivize more submissions which can then be rejected, and many conferences (i.e. AAAI and IJCAI come to mind) seem like they explicitly want to lower their acceptance rates.

It's not part of their model but complex and inconsistently-enforced checklist rules fit nicely into the same story.

Expand full comment
Ben Recht's avatar

One of the more perplexing aspects of the growth of NeurIPS is they kept the acceptance rate constant. It was a deliberate decision and one I never understood.

Expand full comment
Misha Belkin's avatar

It is not just statistics. Why is it necessary to have a reviewer checklist with 9 (!) different items for ethical concerns?

Expand full comment
Ben Recht's avatar

Yes, totally. I zeroed in on statistics because that particular checklist item illustrates further how statistics is used as part of arbitrary rulemaking. But I would strongly argue in favor of removing the checklist altogether.

Expand full comment
Misha Belkin's avatar

I would bet that not a single submission was flagged under some of those categories. Certainly any problematic issues can be reported to ACs/PCs directly.

Expand full comment
Seth's avatar

I'm going to steelman the checklist. If LLM outputs are variable by nature, isn't it all the more important to... measure and describe that variability? My impression is that in CS, many people are not used to thinking about variability and uncertainty quantification. In that case, a checklist reminding people about uncertainty quantification seems reasonable.

Of course, an appropriate checklist item in this case is, "hey, did you remember to think about outcome variability and/or uncertainty quantification?"

Expand full comment
Ben Recht's avatar

Your checklist item is fine (though I would greatly prefer it with the word "quantification" deleted). But I just can't imagine a person doing good work on LLMs who wouldn't think about that! I don't see why it has to be written down, and I don't see why someone needs to check that they agree before submitting a paper.

Expand full comment
Seth's avatar

Yes, Andrew Gelman talks about this a lot in social science; what you actually want is for people to do good science, and no checklist has yet been invented that accomplishes this.

The best a checklist can do is flag certain common manifestations of bad science. Which may be worthwhile, but perhaps mostly as a peculiar ritual for establishing social norms and building consensus. The direct effect on the quality of a given manuscript is likely nill, maybe even slightly negative.

Expand full comment
Ben Recht's avatar

Yes, agree that it's negative. That said, I think the question of "what makes good science?" has no easy answers.

https://www.argmin.net/p/the-good-the-bad-and-the-science

Expand full comment
Eli's avatar
21hEdited

I see NeurIPS is well on its way to becoming a systems neuroscience meeting (derogatory). Statistical bureaucracy is how you perform the appearance of rigor when your field mostly consists of stamp-collecting and doesn't have a coherent organizing theory across labs.

Expand full comment
Ben Recht's avatar

The thing that's crazy is machine learning does have a coherent organizing theory! More data, more compute, throw spaghetti at the wall and see what sticks. Donoho's Data Science At The Singularity cleans up my glib version and makes the whole enterprise sound more principled.

Expand full comment
Eli's avatar

And systems neuroscience could have a coherent organizing theory, brain evolution, if experimenters gave a damn to try and line up the areas we record in with ecologically valid tasks testing what those areas evolved to do. Stamp collecting and statistical bureaucracy are choices.

Expand full comment
Johan Ugander's avatar

The checklist bureaucracy has precedent in the journal world. Nature journals, e.g., have had them since ca. 2013. See [1]. I actually find the Nature checklist on data presentation [2] to be quite reasonable. It is only four bullets:

- Individual data points are shown when possible, and always for n ≤ 10

- The format shows data distribution clearly (e.g. dot plots, box-and-whisker plots)

- Box-plot elements are defined (e.g. center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers)

- Clearly defined error bars are present and what they represent (SD, SE, CI) is noted

Basically, you can use whatever error bars you want, but it's just asking you to define your visual elements. Feels like a reasonable ask.

Looking this up just now, I'm seeing there's also a "Reporting Summary" [3] that appears to be required as well [4], which does look a little heavy handed at first glance, but I don't think I'm opposed to it. More to your point, they've also gotten quite strict on the data and code requirements, or so I've heard.

[1] "Checklists work to improve science" (2018)

https://www.nature.com/articles/d41586-018-04590-7

[2] https://www.nature.com/documents/nr-editorial-policy-checklist-flat.pdf

[3] https://www.nature.com/documents/nr-reporting-summary-flat.pdf

[4] https://www.nature.com/nature-portfolio/editorial-policies/reporting-standards

Expand full comment
Ben Recht's avatar

Those data viz rules are just a "house style." It's like umlauts in the New Yorker.

The reporting summary is a whole other matter. And that editorial [1] is hot garbage. The Nature/Science journal empires embody much of what's wrong with academic science.

Expand full comment
Johan Ugander's avatar

I don't know... having read many papers with figures that have error bars, but been unable to tell what the error bars represent (1 SD? 1.96SD? 5/95 some other way? Etc), I actually find it useful that they require authors to state what error bars they're using! But the reporting summary seems overblown.

Which is to say, I think you're 100% right about your bureaucracy theory lens on how these checklists are metastasizing in length and complexity. I just wanted to chime in with some perspective outside the recent adoption within ML..!

Expand full comment
Ben Recht's avatar

I would argue in favor of maximum flexibility in visualization, but put the burden on the authors to explain what they are showing. The tricky part is that fields develop visualization conventions that quickly feel like second nature, and then they stop explaining their plots as they feel such explanations are redundant.

Expand full comment
Alex Tolley's avatar

That last is worrying, as unless these papers are just transient, future investigators wishing to replicate some study or doing a retrospective will have little idea about these error bars. I am not even sure why an experiment would use SE rather than 2 STD (and preferably a p-value). And as you say, what is an "experiment" without a control? It is just an observation with little context.

Expand full comment
Tom Dietterich's avatar

I agree that the checklist and the current UQ item is out of hand. But there are a lot of bad papers claiming small improvements submitted to ML conferences and journals. If the authors computed even the simplest error estimates, they would see that those small improvements are not real. Hopefully, they would then not submit their paper. Maybe the call for papers should just say "We will reject any paper claiming small improvements" and leave it at that?

Expand full comment
Apratim Dey's avatar

I value your suggestion. However, my view is that ideas need to be presented and even if they don't achieve an improvement, they should be allowed to publish. It doesn't help that so many people keep reinventing the wheel. A not so great idea for one problem might generate a great idea for another totally unrelated problem. I don't think publications should focus on "right now, for our subject, this is what is important" rather focus on the worthiness of ideas. Of course, reviewers should also be well equipped to critique ideas they feel are weak - that's their job! Unfortunately we are in a situation where authors overclaim and write WAY too many papers and reviewers simply don't have the means to verify some claim without running the experiments themselves. The culture needs to shift although I have no idea what it should be.

Expand full comment
Nico Formanek's avatar

Apart from your critique of NeurIPS, I think it is key to recognize that error bars involve an additional inferential step - you have to model the errors/uncertainty. This has been implicitly recognized by communities who have a longer tradition expressing measurement uncertainty (e.g. physics) and I would not hold this additional inference against error bars per se. They can be informative.

I do think though, that at some point your uncertainty modeling can collapse because it itself is too uncertain. Do you think this is the case in ML? Is it just too hard to get a grip on uncertainty there?

Expand full comment