Selecting for complexity

Do machine learning researchers actually care about simple baselines?

Aug 18, 2025

Here’s a question that puzzles me: You are assigned a machine prediction task at work. You build a representative train and test set. You train up a bespoke transformer, inventing a few new architectural modifications that get you a lower test error. You also fit a linear classifier to your data. Both models give the same prediction accuracy. Which should you use?

I’m not sure the answer is so straightforward. Sometimes, management demands fancy stuff. Maybe you could convince yourself that the transformer is more general and might be useful for future downstream tasks. Everyone knows transformers are general-purpose models that leverage computation to scale with data, right?

There are, of course, advantages to linear models. It’s easy to explain the implementation details of linear models. The code to build and deploy the models is usually pretty concise and easy to read. Many different packages support training linear models. Linear models typically don’t require a ton of computation to train. They are easy to debug. They are easy to maintain within a codebase. People feel like they are easier to interpret (I’m not one of those people).

It seems like for most people, linear models are preferable if they achieve comparable prediction performance on the associated machine learning task. I can’t run polls on Substack, so I’m basing my assessment on the vibes of my social network. You can tell me otherwise in the comments. I still see plenty of dunk papers showing complex machine learning models are outperformed by linear prediction. For example, last week I stumbled across this paper showing that “foundation” models for predicting the effects of genetic modifications did not outperform linear baselines. In this paper, the linear baseline was a simple, hand-coded rule, not even trained on data. This heuristic outperformed a transformer model pretrained on 30 million human single-cell transcriptomes for three days on twelve GPUs that had previously been featured in Nature.

Here’s another example. Last month, I wrote about a hedge fund claiming to be using cutting-edge AI. It turned out they were just using random Fourier features. But they were marketing this as “heavyweight machine learning” that signaled the future of advanced computational finance. They proudly claimed that more complexity in prediction models led to better investment returns. Of course, it turns out that linear prediction outperforms their heavyweight machine learning.

These are just two examples I’ve found in the last week. Send me more examples if you’ve seen them. I know they’re out there.

The question remains, however, do we care that linear baselines are often all you need? There is certainly social signalling that we don’t care that much. The Genformer paper was in Nature. The linear baselines paper was in Nature Methods. The random feature finance paper has been all over the financial press. I haven’t seen anyone mention Buncic’s SSRN rebuttal demonstrating the value of linear baselines.

I get why we are enthralled with the big neural models. Whether it's AlexNet, AlphaGo, AlphaFold, or ChatGPT, the most impressive results in machine learning have required scaling neural models. And if these neural models work on the hard problems, why do we care what happens on easy problems? Perhaps it makes sense to only invest in “general purpose” algorithms that work on the most challenging problems in predictive engineering.

To build predictive systems, we first have to understand where our raw data comes from, what it is, what it means, which parts are predictable from other parts, and how we can get more of it. The big (bitter?) lesson of the last two decades—though it was staring us in our face for decades before that—is that once you specify these parts of the problem, anything goes in building the function that solves your specified prediction problem. You can run reinforce gradients and do lora updates and have different normalizers and use triple u-neck autoencoders. I don’t know what any of that stuff is. The important part is to be able to run something that quickly gets your training error to zero and has good out-of-sample error.

Why would we prefer something other than peak out-of-sample performance?

I almost don’t want to try to answer this question, so I can get your hot takes first. But I know I should try to make a compelling case, even if I’m not sure I’ve convinced myself it’s right yet.

First, more complex codebases necessarily incorporate more inefficiencies, technical debt, and path dependence. Perhaps we don’t care about such issues, but leaner software should be cheaper to develop, maintain, and deploy. Unfortunately, it’s hard to articulate these requirements in key performance indicators. I don’t know how to build a leaderboard that selects a codebase with no security leaks that can be modified by any competent software engineer.

More importantly, simpler models mean wider access to methods. They mean you don’t need a cluster of thousands of GPUs to build predictive models. They mean taking power back from hyperscalers. As I’ve written before, one of the most important challenges in machine learning is building high-performing, open-corpus, open-source language models. Ironically, whereas industry argues progress follows massive capital expenditures in data centers, accelerating open source development means looking to methods that outperform pure nihilistic scaling. We have to think about our design principles and how to evaluate our artifacts if our goal is developing simpler, more accessible machine learning models.

Alyosha Efros

Aug 18

Doesn't seem like a new issue. Back in the 2000s, most of those fancy graphical model papers performed worse than nearest neighbor baselines and no one cared. Another great example I always quote is Viola & Jones (2001) getting much more fame than Rowley, Baluja, Kanade (1998) just because the former used fancy novel algorithm.

Expand full comment

1 reply

Mark Nelson

This is an interesting provocation! Here are two takes on it that I don't think entirely duplicate what's been said so far:

In the specific case of high-profile applied deep learning papers, I read these papers as implicitly making a kind of historical claim: The Deep Learning Revolution, which led to major progress on problems like facial recognition, protein folding, and machine translation, could revolutionize [our field X] too, if we adopt the new techniques. And [paper] presents evidence that it does. For that to hold up, the new techniques have to actually outperform old ones. Any given paper author can of course say they aren't actually arguing this. But I think it's in the background of why people care so much about these papers in the first place.

In a more general sense, I think it is just genuinely interesting to know when you might "need" or even "want" a more complex model. Maybe in somewhat the same way that reverse mathematics is interesting, although the analogy isn't great because provable results are hard to come by.

30 more comments...

arg min

Discussion about this post