Selecting for complexity
Do machine learning researchers actually care about simple baselines?
Here’s a question that puzzles me: You are assigned a machine prediction task at work. You build a representative train and test set. You train up a bespoke transformer, inventing a few new architectural modifications that get you a lower test error. You also fit a linear classifier to your data. Both models give the same prediction accuracy. Which should you use?
I’m not sure the answer is so straightforward. Sometimes, management demands fancy stuff. Maybe you could convince yourself that the transformer is more general and might be useful for future downstream tasks. Everyone knows transformers are general-purpose models that leverage computation to scale with data, right?
There are, of course, advantages to linear models. It’s easy to explain the implementation details of linear models. The code to build and deploy the models is usually pretty concise and easy to read. Many different packages support training linear models. Linear models typically don’t require a ton of computation to train. They are easy to debug. They are easy to maintain within a codebase. People feel like they are easier to interpret (I’m not one of those people).
It seems like for most people, linear models are preferable if they achieve comparable prediction performance on the associated machine learning task. I can’t run polls on Substack, so I’m basing my assessment on the vibes of my social network. You can tell me otherwise in the comments. I still see plenty of dunk papers showing complex machine learning models are outperformed by linear prediction. For example, last week I stumbled across this paper showing that “foundation” models for predicting the effects of genetic modifications did not outperform linear baselines. In this paper, the linear baseline was a simple, hand-coded rule, not even trained on data. This heuristic outperformed a transformer model pretrained on 30 million human single-cell transcriptomes for three days on twelve GPUs that had previously been featured in Nature.
Here’s another example. Last month, I wrote about a hedge fund claiming to be using cutting-edge AI. It turned out they were just using random Fourier features. But they were marketing this as “heavyweight machine learning” that signaled the future of advanced computational finance. They proudly claimed that more complexity in prediction models led to better investment returns. Of course, it turns out that linear prediction outperforms their heavyweight machine learning.
These are just two examples I’ve found in the last week. Send me more examples if you’ve seen them. I know they’re out there.
The question remains, however, do we care that linear baselines are often all you need? There is certainly social signalling that we don’t care that much. The Genformer paper was in Nature. The linear baselines paper was in Nature Methods. The random feature finance paper has been all over the financial press. I haven’t seen anyone mention Buncic’s SSRN rebuttal demonstrating the value of linear baselines.
I get why we are enthralled with the big neural models. Whether it's AlexNet, AlphaGo, AlphaFold, or ChatGPT, the most impressive results in machine learning have required scaling neural models. And if these neural models work on the hard problems, why do we care what happens on easy problems? Perhaps it makes sense to only invest in “general purpose” algorithms that work on the most challenging problems in predictive engineering.
To build predictive systems, we first have to understand where our raw data comes from, what it is, what it means, which parts are predictable from other parts, and how we can get more of it. The big (bitter?) lesson of the last two decades—though it was staring us in our face for decades before that—is that once you specify these parts of the problem, anything goes in building the function that solves your specified prediction problem. You can run reinforce gradients and do lora updates and have different normalizers and use triple u-neck autoencoders. I don’t know what any of that stuff is. The important part is to be able to run something that quickly gets your training error to zero and has good out-of-sample error.
Why would we prefer something other than peak out-of-sample performance?
I almost don’t want to try to answer this question, so I can get your hot takes first. But I know I should try to make a compelling case, even if I’m not sure I’ve convinced myself it’s right yet.
First, more complex codebases necessarily incorporate more inefficiencies, technical debt, and path dependence. Perhaps we don’t care about such issues, but leaner software should be cheaper to develop, maintain, and deploy. Unfortunately, it’s hard to articulate these requirements in key performance indicators. I don’t know how to build a leaderboard that selects a codebase with no security leaks that can be modified by any competent software engineer.
More importantly, simpler models mean wider access to methods. They mean you don’t need a cluster of thousands of GPUs to build predictive models. They mean taking power back from hyperscalers. As I’ve written before, one of the most important challenges in machine learning is building high-performing, open-corpus, open-source language models. Ironically, whereas industry argues progress follows massive capital expenditures in data centers, accelerating open source development means looking to methods that outperform pure nihilistic scaling. We have to think about our design principles and how to evaluate our artifacts if our goal is developing simpler, more accessible machine learning models.
The Geneformer example is interesting --- the underperformance of a linear baseline shows that the problem is _too_ hard. You don't need a transformer to do linear regression, but when you lack data or grounding to solve a problem meaningfully, you perhaps don't need a transformer either. I suspect the predicting financial markets example may fall into the "too hard" case as well. It seems that these cases are meaningfully distinct even as they can look similar?
I'd love to hear what you think of their rebuttal:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5346842