7 Comments
User's avatar
Jessica Hullman's avatar

Hi Ben,

To be clear, my comments to Anoneuid should in no way be taken as advocating for mindless application of frequentist statistics nor blanket reinforcement of reform heuristics that someone has decided must be applied -- I would hope this is obvious, as it would go against so much of my prior research and all of my blog posts on how science reform being misguided. I think the NeurIPS checklist is a mess (as it seems you do).

What motivated my responses to Anoneuid (as should be clear from reading them) was my difficulty with the logical implication of your dismissal of statistics in ML that authors should be absolved of having to match their evidence to their claims. Insisting on certain norms for communicating error is silly. However, presenting data from some experiment you ran without explaining enough about the process for generating those results for someone to evaluate your claims is also silly. At that point, the experiments are simply performative. In which case I'd prefer not to see them at all.

Expand full comment
Ben Recht's avatar

Yes, although I think we have different conceptions of what constitutes "explaining enough." Machine learning has the advantage over most sciences in that it is computer science. Most results are claims solely about in silico behavior. For in silico behavior, a Python notebook that allows every step to be re-executed provides a different form of explanation from statistical summaries of a field experiment.

Expand full comment
Onid's avatar

Working at one of these industry labs, I’ve had the experience multiple times of being told that an interesting algorithmic idea could not be published because it was being used (or might be used) in our production system.

You talk about how openness is a necessary principle for scientific progress, and I strongly agree. But the problem is that these industry labs aren’t optimizing for scientific progress - that ended the moment their research programs stopped being about prestige and started being about profit and competition. It’s a sad dynamic but that’s where we are right now.

Expand full comment
Ben Recht's avatar

I agree, and the machine learning research community is going to be best served not following them down that hole.

I remain optimistic about open models!

Expand full comment
mirrormere's avatar

I think it's interesting to consider what would happen to ML if Moore's law was not there and compute gains would stall.

I would predict that progress would quickly stall and interest in rigorous frequentist statistics would steadily rise, until the field would look like psychology.

Machine learning can work like it does, with an open system of pull requests, without "statistics" because most improvements do work, and are easily discernable as working, (and maybe that is just because all improvements work "on average", because benefits from compute are rising every year).

Expand full comment
Ben Recht's avatar

Machine learning progress does pretty cleanly track with Moore's law. Now, Zuckerberg is proposing data centers the size of Manhattan. Exponential scaling must end eventually, but I can't predict when or what happens once it does.

I do think that predictive statistics tends to "work" because it's more goal-directed than science (i.e., it's engineering) and hence more amenable to optimization. If all you need is to optimize a benchmark, even grossly inefficient RL can hill climb.

Expand full comment
Alex Tolley's avatar

I must be missing something. Why cannot older ML techniques like Decision Trees be evaluated by statistics, and these used to help progress? Didn't we have RoC[?] charts to show improvements?

Is the problem with scaling, making runs extremely expensive? Is it the flexibility of prompts? As someone suggested in the comments, aren't LLM experiments really just psychology experiments now, and they would need N examples of the trained LLMs to do the statistics?

I do not understand this argument.

However, I do understand the issue of requirements as a constraining issue. However, as we are seeing in science publications, bad actors are polluting the science base with bogus papers, poor analysis, cherry-picking data, bad or faked peer review, etc, etc. LLMs are expensive, and a lot rides on success. So we see similar bad behavior to boost the apparent performance of an LLM in the "league tables".

Would it not be better to focus on making small ML models that work well, trained on different inputs, so that we have a competitive landscape that looks more like book publishing than "the one AI to rule them all"?

Expand full comment