Standard error of what now?

Jul 3

The NeurIPS checklist corroborates the bureaucratic theory of statistics.

27 Comments

these days as a PhD student, publishing your first NeurIPS/ICLR/ICML paper is like getting your SAG card or something. it makes you eligible for big tech internships. that’s probably worth a couple hundred thousand $ in expectation (low estimate).

those incentives are just too strong, it’s inevitable that things will become strange and distorted. adding more and more bureaucratic process won’t actually help, except perhaps by increasing desk rejections and reducing submissions on the margin.

I don’t really see a way to stop it. It’s public record who publishes at NeurIPS and there’s no way to stop third parties from using that information to make hiring decisions. But the end result is that we’ve all been conscripted to be first-pass recruiters for Google and Meta.

Expand full comment

There's no way to stop it, but we could all strive to be more honest about the situation and say it more loudly!

Expand full comment

another thing that comes to mind is the paper "Academic journals, incentives, and the quality of peer review: a model" by Zollman et al.

They have some toy economic models for how journals and authors might behave when the "prestige" of a journal is measured by acceptance rates.

"...This results in a strange process whereby journals make peer review worse in an attempt to induce bad papers to submit, but maintain sufficiently good peer review to ensure that a large proportion of those bad papers will probably be rejected."

The "stylized facts" they invent are unfortunately pretty compelling as a description of CS conference incentives. Prestige = acceptance rate is pretty much true, high-variance peer review does incentivize more submissions which can then be rejected, and many conferences (i.e. AAAI and IJCAI come to mind) seem like they explicitly want to lower their acceptance rates.

It's not part of their model but complex and inconsistently-enforced checklist rules fit nicely into the same story.

Expand full comment

One of the more perplexing aspects of the growth of NeurIPS is they kept the acceptance rate constant. It was a deliberate decision and one I never understood.

Expand full comment

I'm going to steelman the checklist. If LLM outputs are variable by nature, isn't it all the more important to... measure and describe that variability? My impression is that in CS, many people are not used to thinking about variability and uncertainty quantification. In that case, a checklist reminding people about uncertainty quantification seems reasonable.

Of course, an appropriate checklist item in this case is, "hey, did you remember to think about outcome variability and/or uncertainty quantification?"

Expand full comment

Your checklist item is fine (though I would greatly prefer it with the word "quantification" deleted). But I just can't imagine a person doing good work on LLMs who wouldn't think about that! I don't see why it has to be written down, and I don't see why someone needs to check that they agree before submitting a paper.

Expand full comment

Yes, Andrew Gelman talks about this a lot in social science; what you actually want is for people to do good science, and no checklist has yet been invented that accomplishes this.

The best a checklist can do is flag certain common manifestations of bad science. Which may be worthwhile, but perhaps mostly as a peculiar ritual for establishing social norms and building consensus. The direct effect on the quality of a given manuscript is likely nill, maybe even slightly negative.

Expand full comment

Yes, agree that it's negative. That said, I think the question of "what makes good science?" has no easy answers.

https://www.argmin.net/p/the-good-the-bad-and-the-science

Expand full comment

Apart from your critique of NeurIPS, I think it is key to recognize that error bars involve an additional inferential step - you have to model the errors/uncertainty. This has been implicitly recognized by communities who have a longer tradition expressing measurement uncertainty (e.g. physics) and I would not hold this additional inference against error bars per se. They can be informative.

I do think though, that at some point your uncertainty modeling can collapse because it itself is too uncertain. Do you think this is the case in ML? Is it just too hard to get a grip on uncertainty there?

Expand full comment

Yes, and this is one of those cases where I side with the Bayesians over the Frequentists. If you have a full forward model and an observation, then credible intervals are inferentially meaningful. The issue is just whether you can accurately compute the posterior.

Specifically with regards to ML, the question is always uncertainty with respect to what? With respect to the data? With respect to the algorithm? With respect to the frantic pace a paper is thrown together to meet a conference deadline? I'm not sure *what* uncertainty we want to or need to quantify. I can't think of a single breakthrough in machine learning that needed error bars to justify its significance.

Expand full comment

It is not just statistics. Why is it necessary to have a reviewer checklist with 9 (!) different items for ethical concerns?

Expand full comment

Yes, totally. I zeroed in on statistics because that particular checklist item illustrates further how statistics is used as part of arbitrary rulemaking. But I would strongly argue in favor of removing the checklist altogether.

Expand full comment

I would bet that not a single submission was flagged under some of those categories. Certainly any problematic issues can be reported to ACs/PCs directly.

Expand full comment

Jul 3Edited

I see NeurIPS is well on its way to becoming a systems neuroscience meeting (derogatory). Statistical bureaucracy is how you perform the appearance of rigor when your field mostly consists of stamp-collecting and doesn't have a coherent organizing theory across labs.

Expand full comment

The thing that's crazy is machine learning does have a coherent organizing theory! More data, more compute, throw spaghetti at the wall and see what sticks. Donoho's Data Science At The Singularity cleans up my glib version and makes the whole enterprise sound more principled.

Expand full comment

And systems neuroscience could have a coherent organizing theory, brain evolution, if experimenters gave a damn to try and line up the areas we record in with ecologically valid tasks testing what those areas evolved to do. Stamp collecting and statistical bureaucracy are choices.

Expand full comment

The checklist bureaucracy has precedent in the journal world. Nature journals, e.g., have had them since ca. 2013. See [1]. I actually find the Nature checklist on data presentation [2] to be quite reasonable. It is only four bullets:

- Individual data points are shown when possible, and always for n ≤ 10

- The format shows data distribution clearly (e.g. dot plots, box-and-whisker plots)

- Box-plot elements are defined (e.g. center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers)

- Clearly defined error bars are present and what they represent (SD, SE, CI) is noted

Basically, you can use whatever error bars you want, but it's just asking you to define your visual elements. Feels like a reasonable ask.

Looking this up just now, I'm seeing there's also a "Reporting Summary" [3] that appears to be required as well [4], which does look a little heavy handed at first glance, but I don't think I'm opposed to it. More to your point, they've also gotten quite strict on the data and code requirements, or so I've heard.

[1] "Checklists work to improve science" (2018)

https://www.nature.com/articles/d41586-018-04590-7

[2] https://www.nature.com/documents/nr-editorial-policy-checklist-flat.pdf

[3] https://www.nature.com/documents/nr-reporting-summary-flat.pdf

[4] https://www.nature.com/nature-portfolio/editorial-policies/reporting-standards

Expand full comment

Those data viz rules are just a "house style." It's like umlauts in the New Yorker.

The reporting summary is a whole other matter. And that editorial [1] is hot garbage. The Nature/Science journal empires embody much of what's wrong with academic science.

Expand full comment

I don't know... having read many papers with figures that have error bars, but been unable to tell what the error bars represent (1 SD? 1.96SD? 5/95 some other way? Etc), I actually find it useful that they require authors to state what error bars they're using! But the reporting summary seems overblown.

Which is to say, I think you're 100% right about your bureaucracy theory lens on how these checklists are metastasizing in length and complexity. I just wanted to chime in with some perspective outside the recent adoption within ML..!

Expand full comment

I would argue in favor of maximum flexibility in visualization, but put the burden on the authors to explain what they are showing. The tricky part is that fields develop visualization conventions that quickly feel like second nature, and then they stop explaining their plots as they feel such explanations are redundant.

Expand full comment

That last is worrying, as unless these papers are just transient, future investigators wishing to replicate some study or doing a retrospective will have little idea about these error bars. I am not even sure why an experiment would use SE rather than 2 STD (and preferably a p-value). And as you say, what is an "experiment" without a control? It is just an observation with little context.

Expand full comment

I have several biology colleagues who greatly prefer SE because it makes the variability look smaller! They also mumble something about the variability of the sample mean. But if that were the concern, just report confidence interval.

I personally prefer 1SD, because I'm interested in the variability of the data, and SD gives a rough idea of what I'd see if I repeated the experiment, even with a different number of samples.

All this is ironic because Bayesianism has driven much of ML in the past couple of decades, but this nonsense of checklists and error bars is attributable to the frequentists, who invented the whole fantasy of a "correct" recipe.

Expand full comment

1SD is just fine, as it is easy to double the length of the bar if needed.

What has been tempting to medical/biology is data mining, extracting the significant p-values without making the Bonferroni multiple testing correction.

One is supposed to know how to use a new toy when you use it.

While most people do statistics with the assumption that the data is a Gaussian distribution, one can always use non-parametric tests, which, while less sensitive, don't require such a distribution. It is funny how, in the pre-personal computer age, we biologists used Chi-Squared tests for simplicity, a test that seems to have all but disappeared with the arrival of cheap computers and software.

Expand full comment

I agree that the checklist and the current UQ item is out of hand. But there are a lot of bad papers claiming small improvements submitted to ML conferences and journals. If the authors computed even the simplest error estimates, they would see that those small improvements are not real. Hopefully, they would then not submit their paper. Maybe the call for papers should just say "We will reject any paper claiming small improvements" and leave it at that?

Expand full comment

I value your suggestion. However, my view is that ideas need to be presented and even if they don't achieve an improvement, they should be allowed to publish. It doesn't help that so many people keep reinventing the wheel. A not so great idea for one problem might generate a great idea for another totally unrelated problem. I don't think publications should focus on "right now, for our subject, this is what is important" rather focus on the worthiness of ideas. Of course, reviewers should also be well equipped to critique ideas they feel are weak - that's their job! Unfortunately we are in a situation where authors overclaim and write WAY too many papers and reviewers simply don't have the means to verify some claim without running the experiments themselves. The culture needs to shift although I have no idea what it should be.

Expand full comment

If the paper is claiming that the new ideas are an improvement, it needs to provide convincing evidence. If the paper is claiming an alternative way of achieving existing performance, but, say, 100x faster, the evidence will be CPU time not AUROC (or other metric). In short, an interesting idea is not enough. It needs to be backed by convincing experimental or analytical results.

Expand full comment

Yes I agree with this.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts