these days as a PhD student, publishing your first NeurIPS/ICLR/ICML paper is like getting your SAG card or something. it makes you eligible for big tech internships. that’s probably worth a couple hundred thousand $ in expectation (low estimate).
those incentives are just too strong, it’s inevitable that things will become strange and distorted. adding more and more bureaucratic process won’t actually help, except perhaps by increasing desk rejections and reducing submissions on the margin.
I don’t really see a way to stop it. It’s public record who publishes at NeurIPS and there’s no way to stop third parties from using that information to make hiring decisions. But the end result is that we’ve all been conscripted to be first-pass recruiters for Google and Meta.
another thing that comes to mind is the paper "Academic journals, incentives, and the quality of peer review: a model" by Zollman et al.
They have some toy economic models for how journals and authors might behave when the "prestige" of a journal is measured by acceptance rates.
"...This results in a strange process whereby journals make peer review worse in an attempt to induce bad papers to submit, but maintain sufficiently good peer review to ensure that a large proportion of those bad papers will probably be rejected."
The "stylized facts" they invent are unfortunately pretty compelling as a description of CS conference incentives. Prestige = acceptance rate is pretty much true, high-variance peer review does incentivize more submissions which can then be rejected, and many conferences (i.e. AAAI and IJCAI come to mind) seem like they explicitly want to lower their acceptance rates.
It's not part of their model but complex and inconsistently-enforced checklist rules fit nicely into the same story.
One of the more perplexing aspects of the growth of NeurIPS is they kept the acceptance rate constant. It was a deliberate decision and one I never understood.
Yes, totally. I zeroed in on statistics because that particular checklist item illustrates further how statistics is used as part of arbitrary rulemaking. But I would strongly argue in favor of removing the checklist altogether.
I would bet that not a single submission was flagged under some of those categories. Certainly any problematic issues can be reported to ACs/PCs directly.
I'm going to steelman the checklist. If LLM outputs are variable by nature, isn't it all the more important to... measure and describe that variability? My impression is that in CS, many people are not used to thinking about variability and uncertainty quantification. In that case, a checklist reminding people about uncertainty quantification seems reasonable.
Of course, an appropriate checklist item in this case is, "hey, did you remember to think about outcome variability and/or uncertainty quantification?"
Your checklist item is fine (though I would greatly prefer it with the word "quantification" deleted). But I just can't imagine a person doing good work on LLMs who wouldn't think about that! I don't see why it has to be written down, and I don't see why someone needs to check that they agree before submitting a paper.
Yes, Andrew Gelman talks about this a lot in social science; what you actually want is for people to do good science, and no checklist has yet been invented that accomplishes this.
The best a checklist can do is flag certain common manifestations of bad science. Which may be worthwhile, but perhaps mostly as a peculiar ritual for establishing social norms and building consensus. The direct effect on the quality of a given manuscript is likely nill, maybe even slightly negative.
I see NeurIPS is well on its way to becoming a systems neuroscience meeting (derogatory). Statistical bureaucracy is how you perform the appearance of rigor when your field mostly consists of stamp-collecting and doesn't have a coherent organizing theory across labs.
The thing that's crazy is machine learning does have a coherent organizing theory! More data, more compute, throw spaghetti at the wall and see what sticks. Donoho's Data Science At The Singularity cleans up my glib version and makes the whole enterprise sound more principled.
And systems neuroscience could have a coherent organizing theory, brain evolution, if experimenters gave a damn to try and line up the areas we record in with ecologically valid tasks testing what those areas evolved to do. Stamp collecting and statistical bureaucracy are choices.
The checklist bureaucracy has precedent in the journal world. Nature journals, e.g., have had them since ca. 2013. See [1]. I actually find the Nature checklist on data presentation [2] to be quite reasonable. It is only four bullets:
- Individual data points are shown when possible, and always for n ≤ 10
- The format shows data distribution clearly (e.g. dot plots, box-and-whisker plots)
- Box-plot elements are defined (e.g. center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers)
- Clearly defined error bars are present and what they represent (SD, SE, CI) is noted
Basically, you can use whatever error bars you want, but it's just asking you to define your visual elements. Feels like a reasonable ask.
Looking this up just now, I'm seeing there's also a "Reporting Summary" [3] that appears to be required as well [4], which does look a little heavy handed at first glance, but I don't think I'm opposed to it. More to your point, they've also gotten quite strict on the data and code requirements, or so I've heard.
Those data viz rules are just a "house style." It's like umlauts in the New Yorker.
The reporting summary is a whole other matter. And that editorial [1] is hot garbage. The Nature/Science journal empires embody much of what's wrong with academic science.
I don't know... having read many papers with figures that have error bars, but been unable to tell what the error bars represent (1 SD? 1.96SD? 5/95 some other way? Etc), I actually find it useful that they require authors to state what error bars they're using! But the reporting summary seems overblown.
Which is to say, I think you're 100% right about your bureaucracy theory lens on how these checklists are metastasizing in length and complexity. I just wanted to chime in with some perspective outside the recent adoption within ML..!
I would argue in favor of maximum flexibility in visualization, but put the burden on the authors to explain what they are showing. The tricky part is that fields develop visualization conventions that quickly feel like second nature, and then they stop explaining their plots as they feel such explanations are redundant.
That last is worrying, as unless these papers are just transient, future investigators wishing to replicate some study or doing a retrospective will have little idea about these error bars. I am not even sure why an experiment would use SE rather than 2 STD (and preferably a p-value). And as you say, what is an "experiment" without a control? It is just an observation with little context.
I agree that the checklist and the current UQ item is out of hand. But there are a lot of bad papers claiming small improvements submitted to ML conferences and journals. If the authors computed even the simplest error estimates, they would see that those small improvements are not real. Hopefully, they would then not submit their paper. Maybe the call for papers should just say "We will reject any paper claiming small improvements" and leave it at that?
I value your suggestion. However, my view is that ideas need to be presented and even if they don't achieve an improvement, they should be allowed to publish. It doesn't help that so many people keep reinventing the wheel. A not so great idea for one problem might generate a great idea for another totally unrelated problem. I don't think publications should focus on "right now, for our subject, this is what is important" rather focus on the worthiness of ideas. Of course, reviewers should also be well equipped to critique ideas they feel are weak - that's their job! Unfortunately we are in a situation where authors overclaim and write WAY too many papers and reviewers simply don't have the means to verify some claim without running the experiments themselves. The culture needs to shift although I have no idea what it should be.
Apart from your critique of NeurIPS, I think it is key to recognize that error bars involve an additional inferential step - you have to model the errors/uncertainty. This has been implicitly recognized by communities who have a longer tradition expressing measurement uncertainty (e.g. physics) and I would not hold this additional inference against error bars per se. They can be informative.
I do think though, that at some point your uncertainty modeling can collapse because it itself is too uncertain. Do you think this is the case in ML? Is it just too hard to get a grip on uncertainty there?
these days as a PhD student, publishing your first NeurIPS/ICLR/ICML paper is like getting your SAG card or something. it makes you eligible for big tech internships. that’s probably worth a couple hundred thousand $ in expectation (low estimate).
those incentives are just too strong, it’s inevitable that things will become strange and distorted. adding more and more bureaucratic process won’t actually help, except perhaps by increasing desk rejections and reducing submissions on the margin.
I don’t really see a way to stop it. It’s public record who publishes at NeurIPS and there’s no way to stop third parties from using that information to make hiring decisions. But the end result is that we’ve all been conscripted to be first-pass recruiters for Google and Meta.
There's no way to stop it, but we could all strive to be more honest about the situation and say it more loudly!
another thing that comes to mind is the paper "Academic journals, incentives, and the quality of peer review: a model" by Zollman et al.
They have some toy economic models for how journals and authors might behave when the "prestige" of a journal is measured by acceptance rates.
"...This results in a strange process whereby journals make peer review worse in an attempt to induce bad papers to submit, but maintain sufficiently good peer review to ensure that a large proportion of those bad papers will probably be rejected."
The "stylized facts" they invent are unfortunately pretty compelling as a description of CS conference incentives. Prestige = acceptance rate is pretty much true, high-variance peer review does incentivize more submissions which can then be rejected, and many conferences (i.e. AAAI and IJCAI come to mind) seem like they explicitly want to lower their acceptance rates.
It's not part of their model but complex and inconsistently-enforced checklist rules fit nicely into the same story.
One of the more perplexing aspects of the growth of NeurIPS is they kept the acceptance rate constant. It was a deliberate decision and one I never understood.
It is not just statistics. Why is it necessary to have a reviewer checklist with 9 (!) different items for ethical concerns?
Yes, totally. I zeroed in on statistics because that particular checklist item illustrates further how statistics is used as part of arbitrary rulemaking. But I would strongly argue in favor of removing the checklist altogether.
I would bet that not a single submission was flagged under some of those categories. Certainly any problematic issues can be reported to ACs/PCs directly.
I'm going to steelman the checklist. If LLM outputs are variable by nature, isn't it all the more important to... measure and describe that variability? My impression is that in CS, many people are not used to thinking about variability and uncertainty quantification. In that case, a checklist reminding people about uncertainty quantification seems reasonable.
Of course, an appropriate checklist item in this case is, "hey, did you remember to think about outcome variability and/or uncertainty quantification?"
Your checklist item is fine (though I would greatly prefer it with the word "quantification" deleted). But I just can't imagine a person doing good work on LLMs who wouldn't think about that! I don't see why it has to be written down, and I don't see why someone needs to check that they agree before submitting a paper.
Yes, Andrew Gelman talks about this a lot in social science; what you actually want is for people to do good science, and no checklist has yet been invented that accomplishes this.
The best a checklist can do is flag certain common manifestations of bad science. Which may be worthwhile, but perhaps mostly as a peculiar ritual for establishing social norms and building consensus. The direct effect on the quality of a given manuscript is likely nill, maybe even slightly negative.
Yes, agree that it's negative. That said, I think the question of "what makes good science?" has no easy answers.
https://www.argmin.net/p/the-good-the-bad-and-the-science
I see NeurIPS is well on its way to becoming a systems neuroscience meeting (derogatory). Statistical bureaucracy is how you perform the appearance of rigor when your field mostly consists of stamp-collecting and doesn't have a coherent organizing theory across labs.
The thing that's crazy is machine learning does have a coherent organizing theory! More data, more compute, throw spaghetti at the wall and see what sticks. Donoho's Data Science At The Singularity cleans up my glib version and makes the whole enterprise sound more principled.
And systems neuroscience could have a coherent organizing theory, brain evolution, if experimenters gave a damn to try and line up the areas we record in with ecologically valid tasks testing what those areas evolved to do. Stamp collecting and statistical bureaucracy are choices.
The checklist bureaucracy has precedent in the journal world. Nature journals, e.g., have had them since ca. 2013. See [1]. I actually find the Nature checklist on data presentation [2] to be quite reasonable. It is only four bullets:
- Individual data points are shown when possible, and always for n ≤ 10
- The format shows data distribution clearly (e.g. dot plots, box-and-whisker plots)
- Box-plot elements are defined (e.g. center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers)
- Clearly defined error bars are present and what they represent (SD, SE, CI) is noted
Basically, you can use whatever error bars you want, but it's just asking you to define your visual elements. Feels like a reasonable ask.
Looking this up just now, I'm seeing there's also a "Reporting Summary" [3] that appears to be required as well [4], which does look a little heavy handed at first glance, but I don't think I'm opposed to it. More to your point, they've also gotten quite strict on the data and code requirements, or so I've heard.
[1] "Checklists work to improve science" (2018)
https://www.nature.com/articles/d41586-018-04590-7
[2] https://www.nature.com/documents/nr-editorial-policy-checklist-flat.pdf
[3] https://www.nature.com/documents/nr-reporting-summary-flat.pdf
[4] https://www.nature.com/nature-portfolio/editorial-policies/reporting-standards
Those data viz rules are just a "house style." It's like umlauts in the New Yorker.
The reporting summary is a whole other matter. And that editorial [1] is hot garbage. The Nature/Science journal empires embody much of what's wrong with academic science.
I don't know... having read many papers with figures that have error bars, but been unable to tell what the error bars represent (1 SD? 1.96SD? 5/95 some other way? Etc), I actually find it useful that they require authors to state what error bars they're using! But the reporting summary seems overblown.
Which is to say, I think you're 100% right about your bureaucracy theory lens on how these checklists are metastasizing in length and complexity. I just wanted to chime in with some perspective outside the recent adoption within ML..!
I would argue in favor of maximum flexibility in visualization, but put the burden on the authors to explain what they are showing. The tricky part is that fields develop visualization conventions that quickly feel like second nature, and then they stop explaining their plots as they feel such explanations are redundant.
That last is worrying, as unless these papers are just transient, future investigators wishing to replicate some study or doing a retrospective will have little idea about these error bars. I am not even sure why an experiment would use SE rather than 2 STD (and preferably a p-value). And as you say, what is an "experiment" without a control? It is just an observation with little context.
I agree that the checklist and the current UQ item is out of hand. But there are a lot of bad papers claiming small improvements submitted to ML conferences and journals. If the authors computed even the simplest error estimates, they would see that those small improvements are not real. Hopefully, they would then not submit their paper. Maybe the call for papers should just say "We will reject any paper claiming small improvements" and leave it at that?
I value your suggestion. However, my view is that ideas need to be presented and even if they don't achieve an improvement, they should be allowed to publish. It doesn't help that so many people keep reinventing the wheel. A not so great idea for one problem might generate a great idea for another totally unrelated problem. I don't think publications should focus on "right now, for our subject, this is what is important" rather focus on the worthiness of ideas. Of course, reviewers should also be well equipped to critique ideas they feel are weak - that's their job! Unfortunately we are in a situation where authors overclaim and write WAY too many papers and reviewers simply don't have the means to verify some claim without running the experiments themselves. The culture needs to shift although I have no idea what it should be.
Apart from your critique of NeurIPS, I think it is key to recognize that error bars involve an additional inferential step - you have to model the errors/uncertainty. This has been implicitly recognized by communities who have a longer tradition expressing measurement uncertainty (e.g. physics) and I would not hold this additional inference against error bars per se. They can be informative.
I do think though, that at some point your uncertainty modeling can collapse because it itself is too uncertain. Do you think this is the case in ML? Is it just too hard to get a grip on uncertainty there?