Last week, I wrote two posts on a related theme, but didn’t fully connect them to my earlier thoughts on the topic. I have a particular drum I’ve been beating consistently on this blog since I moved to Substack (and I have even older posts on it, too):
Though machine learning is statistical prediction, classical inferential statistics has nothing interesting to say about the field.
In fact, lessons from classical inferential statistics have historically provided poor, misleading guidance for machine learning practice.
A culture of frictionless reproducibility has been the primary driver of machine learning progress.
I use the term classical inferential statistics loosely here, and Siva Balakrishnan is going to get mad at me about it. He and I cannot agree on a term to describe the narrow statistical subfield that I want to call out: frequentist claims about out-of-sample behavior derived from laws of large numbers and the union bound. This includes statistical significance, null hypothesis significance testing, and error bars.1
I’ve decided to try out “classical” because Jessica Hullman used it in her blog post advocating for more statistics in machine learning. I always find Jessica thought-provoking, and she links to me and asks at the end of the post.
“But others refer to attempts to incentivize more thorough reporting of uncertainty in ML evaluation as “a weird obsession with statistics. What’s up with that, I wonder?”
I’ll reply here, though most of what I wanted to say was already written in the comments under Jessica’s post by frequent stat modeling interlocutor Anoneuoid:
Mandating null hypothesis significance testing and related ideas is a fast way to stifle progress.
Systematic errors are more important than sampling errors.
Since everyone started deep learning, machine learning papers have been reduced to advertisements of pull requests.
You don’t need 90 pages of robustness checks or proofs, since you can simply try the code and decide for yourself whether to accept it.
I swear I am not Anoneuoid. We just happen to strongly agree about everything.
But yes, I have made the same arguments on the blog many times. If the machine learning community had listened to the guidelines from classical inferential statistics, nothing would have ever gotten built. Machine learning was largely developed without the use of classical inferential statistics. In fact, the influence has gone in the opposite direction: since 1960, statistically minded researchers have been trying to bootstrap a “statistical learning theory” by chasing practice.
The problem is that theories derived by shoe-horning practice into classical statistical footwear haven’t been productive. Every time this narrow view of classical statistics is applied, it gives the wrong advice! It’s been actively harmful to the field. It makes incorrect predictions and obsesses about the wrong type of error.
The part of practice that most resembles classical statistics is the train-test paradigm.2 Statistics doesn’t explain why this is successful at all! If anything, this has polarized me against other conclusions drawn from statistical theory. Indeed, it makes me believe that claims about the scientific detriment of p-hacking and uncorrected multiple hypothesis testing are wildly overstated.
Another post critiquing statistics is all well and good, but I have a more important point to make here about what we might consider doing instead. Jessica writes to Anoneuoid:
“But I guess part of what I find confusing about dismissing attempts to hold people accountable for what they are claiming (like the crazy long NeurIPS checklist that the linked post complains about) is that in the absence of suggesting an alternative way to better align the claims and the evidence, it reads as though we’re saying all is ok as is and uncertainty can continue to be an afterthought when presenting evaluation results.”
I often feel like the solutionist burden placed on critics is an easy way to wave off critique. But on this issue, I am actually proposing an alternative! Open models, open data, and open code are a clear, feasible alternative requirement. These are not major asks. As I said in a previous post, the only reason we don’t require these is that there are still a lot of corporate interests at the big machine learning conferences, and these places always have some argument for why they can’t release code and data. David Pfau noted on Twitter that this is rapidly changing with the LLM arms race, with the companies moving towards abandoning publishing altogether. He might be right, but that doesn’t mean we have to encourage their nefarious behavior by embracing their move to proprietary secrecy.
Jessica admits the problem herself in her review of Weijie Su’s argument for more statistics in machine learning research.
"Something a bit cringey that becomes clearer when you see the various statistical challenges laid out like this is that sometimes they arise not just because LLMs are too complex for us to understand, but also because they are proprietary objects.”
The frustration with most modern published LLM papers is industrial closedness reduces open research to ephemeral flailing. If you are taking a proprietary, closed model and doing some prompt engineering to elicit funny answers, you are doing HCI research on proprietary software. If you train a transformer model from scratch on the orbits of planets and don’t use all of the language on the internet, your result says nothing about LLMs. Even if you are fine-tuning an open weights model, there’s only so much we can learn because you have no idea what the training corpus is.
Machine learning has thrived in its embrace of frictionless reproducibility— Open, shareable data. The ability to re-execute code. Competitive Testing. These are powerful tools to mitigate uncertainty. I’ve written some thoughts about why this is, drawing analogies to distributed optimization. I still think this is an excellent direction for meta-scientific study. But for whatever reason, many in the orbit of machine learning seem more interested in developing more statistical tests than in understanding why exactly this practice works so well.
Let me close with quoting Andrew Gelman, who replied to Jessica,
“On the other side, Recht presents a very reasonable engineering perspective that is anti-bureaucratic: we've already made a lot of progress and continue to do so, so don't tell us what to do. Or, to put it more carefully, you can tell us what to do for safety or public policy reasons, but it seems like a mistake to try to restrict researchers' freedom in the belief that this will improve research progress. This general position makes sense to me, and it is similar to many things I've said and written regarding science reform: I don't want to tell people what to do, and I also don't want criticism to be suppressed. That doesn't mean that science-reform proposals are necessarily bad. For example, I find preregistration to be valuable (for the science, not the p-values), but I wouldn't want it to be a requirement.”
I agree strongly here with Andrew. Our mandates should be few and far between. I advocate for only one: stronger norms about openness in code, data, and models.
I’ll leave the Bayesians alone today because no one is yet proposing LDA to be on the NeurIPS checklist yet.
Though the train test paradigm was invented by practice, not by a careful application of statistics.
Hi Ben,
To be clear, my comments to Anoneuid should in no way be taken as advocating for mindless application of frequentist statistics nor blanket reinforcement of reform heuristics that someone has decided must be applied -- I would hope this is obvious, as it would go against so much of my prior research and all of my blog posts on how science reform being misguided. I think the NeurIPS checklist is a mess (as it seems you do).
What motivated my responses to Anoneuid (as should be clear from reading them) was my difficulty with the logical implication of your dismissal of statistics in ML that authors should be absolved of having to match their evidence to their claims. Insisting on certain norms for communicating error is silly. However, presenting data from some experiment you ran without explaining enough about the process for generating those results for someone to evaluate your claims is also silly. At that point, the experiments are simply performative. In which case I'd prefer not to see them at all.
I think it's interesting to consider what would happen to ML if Moore's law was not there and compute gains would stall.
I would predict that progress would quickly stall and interest in rigorous frequentist statistics would steadily rise, until the field would look like psychology.
Machine learning can work like it does, with an open system of pull requests, without "statistics" because most improvements do work, and are easily discernable as working, (and maybe that is just because all improvements work "on average", because benefits from compute are rising every year).