Doesn't seem like a new issue. Back in the 2000s, most of those fancy graphical model papers performed worse than nearest neighbor baselines and no one cared. Another great example I always quote is Viola & Jones (2001) getting much more fame than Rowley, Baluja, Kanade (1998) just because the former used fancy novel algorithm.
This is partly in jest, but the graphical model people were so self-righteous about their "principled approach," that one felt it would have been better to just leave them to bask in their right-mindedness.
This is an interesting provocation! Here are two takes on it that I don't think entirely duplicate what's been said so far:
In the specific case of high-profile applied deep learning papers, I read these papers as implicitly making a kind of historical claim: The Deep Learning Revolution, which led to major progress on problems like facial recognition, protein folding, and machine translation, could revolutionize [our field X] too, if we adopt the new techniques. And [paper] presents evidence that it does. For that to hold up, the new techniques have to actually outperform old ones. Any given paper author can of course say they aren't actually arguing this. But I think it's in the background of why people care so much about these papers in the first place.
In a more general sense, I think it is just genuinely interesting to know when you might "need" or even "want" a more complex model. Maybe in somewhat the same way that reverse mathematics is interesting, although the analogy isn't great because provable results are hard to come by.
As I was reading this post, I feel that a better version of the bitter lesson are actually Brieman's "Statistical Modeling: The Two Cultures" and the follow up (dare I say mea culpa?) to it almost two decade later by Efron's "Prediction, Estimation, and Attribution".
I cannot recommend that paper highly enough. The single most interesting thing I have read in more than a decade. It changed my thinking about "modern" machine learning.
I read the Breiman paper. It does make some excellent points about data models and their not-so-good explanatory power. The paper is nearly 25 years old, and it shows in that he dismisses the lack of explanatory power over predictive power. This was a failure to understand that scientists want explanations for phenomena. There has been a lot of effort in that regard with ANNs. He was enamoured with random forests, which are good predictors, especially vs. more brittle decision trees. But decision trees are easily interpreted, whilst random forests are not. Because he was not so much interested in explanations, he favored random forests over decision trees and other approaches that are more explanatory.
The way I read the Efron and Brieman papers is that, if you want to do science, find physical laws, then explanations (or as Efron calls it attribution) are a thing.
If prediction is the game you are playing then demanding that the model have an explanation just holds you back.
I read the Efron paper. Interestingly, it included a microarray data example. I had some experience with this at a biotech company I worked for. The problem with microarray data is that it is very noisy due to biological differences between organisms, and the technique. 2 samples, one from each of 2 organisms, treated the same way, can look very poorly correlated.
As with attribution, we were using support vector machines to reduce the gene by 2 orders of magnitude to design a small, simpler array to predict disease. As with teh story in the paper, genes that looked important could be removed and still the SVM would provide low error rates.
More recently, I was trying to replicate a paper that used random forests on mass spec data. The authors managed a less than 10% error rate. I couldn't reduce it more than a few percentage points below 20%.
If one is building a tool, like a diagnostic test, then prediction is all one needs, and attribution of variables for explanation is not needed. However, unless the data replicates well, this can result in a model that is a poor predictor. Early microarray experiments on years with a different microarray type, plotted as a dendrogram, clearly showed that the yeast samples aggregated by date of sample collected and even who was running the macroarrays.
LLMs munge huge amounts of data together to produce the proverbial "lossy JPG" of the corpus. Unfortunately, LLMs don't make predictions, although different LLMs will produce different outputs depending on the questions asked from a particular corpus. LLMs might be a contemporary example of "garbage in, garbage out."
As the Efron paper suggests, prediction engines might only have short-term value, whilst scientific explanations tend to have long-term validity.
Hi Ben, I’m a late fan of your blogs, and as a PhD candidate from systems biology wrestling with foundation models (vision transformers, to be specific), I would like to give my two cents
1. You said that “And if these neural models work on the hard problems, why do we care what happens on easy problems?” — however, we are not sure the model really works that well yet. From my field, I will give an example: in digital pathology, there’s the classical task of subtyping tumor based on its morphology. For prostate and lung carcinoma the tumors can be surprisingly variable in morphology that even humans pathologists don’t always agree in their annotations, and the performance metrics reported in these foundation model papers are not really that high compared to their CNN predecessors. You might even say the label is noisy or the problem is a bit ill defined, even.
2. You asked “Why would we prefer something other than peak out-of-sample performance?” Here, in systems biology, the traditional way was to use ordinary differential equation and stochastic modeling to describe biological systems. These models are interpretable and testable, in that experimental perturbations often have direct connection to changing model parameters, so we can test if the model really holds and, if so, use the model as a quantitative description of the complex biochemical mechanisms at work. Deep learning models, especially the off-the-shelf ones using the hottest architecture of the time, that are published in biological journals don’t yet give the same level of insight as previous models, even though we might more efficiently make predictions (that does help our field too)
Linear transformers (https://manifestai.com/blogposts/faster-after-all/) are behind, but I can't tell if they are behind because linear is bad, or because more effort is spent on tuning regular transformers.
Great post! This is related to one of our recommendations in a position paper we drafted about review standards in "AI for social impact" (broadly defined).
Would love to see more discussion of simple baselines in the research community. On the one hand, good performance of a simple baseline is great news for the problem/achieving impacts: much simpler to deploy, much more widely available, can be given away to the masses.
On the other hand, it's bad news for making a compelling case for the superiority of a particular (usually new) method. I suppose this is why it comes up so often in "dunking" on complex methods, in pushing back on a suggestion (implicit or explicit) that the complexity is necessary for good performance - it's a rhetorical defense against overclaiming/AI hype.
A lot of the gains from ML can be achieved by using the simpler predictors, using just # prior misconduct complaints, etc.
We recommend that reviewing needs to change in order to normalize simple baselines. Reviewers are used to asking for strong baselines, eg "SOTA". They should also ask for simple baselines. But papers also shouldn't necessarily be penalized if simple baselines work well for a particular dataset -- again, this is great news for machine learning & achieving impact in the field! It's just one evidence point in a broader argument that the paper should make to justify the existence of a new method, whether it be via a more extensive evaluation on more/different datasets, arguing for intrinsic interestingness, etc.
Without changes in review norms, it seems likely that otherwise simple baselines will join the "file drawer". Good performance of baselines shouldn't be viewed as intrinsically negative results - they're great for ML and achieving impact on relevant tasks. Not so great for advertising new complex approaches.
Would you draw a distinction between a purely mechanical linear baseline (adding up some factors in a table, like your police misconduct example) and a statistical linear baseline (fitting logistic regression or some other machine learning model based on past cases)?
Good point! Yeah, I would. I think a purely mechanical linear baseline like in this case (i.e. without trying to fit coefficients) works and is more distinct because the goal is ranking/triage rather than predicting absolute magnitude of risk. (For absolute magnitude, might as well just take averages over historical cases!)
I think we do learn more about the domain by comparing to purely mechanical linear baselines.
It's not obvious before the fact that "common sense" is predictive enough -- I think there's procedural benefits to thoroughly establishing that this is the case by comparing to best-effort ML.
Four quick virtues of linear models for prediction you didn’t mention:
- the artifacts themselves are lighter weight. You can store a jillion per gigabyte and move them around easily.
- determinism of solution radically increases auditability, reproducibility for others with the same data, etc.
- fast training (and few hyperparameters) means your predictor will be done this afternoon rather than whenever you get your compute up and running and your hyperparameters tuned. Combined with determinism you will spend more time on task than if you had to wonder about whether your training run worked or was any good.
- you understand the span of the output space. If you care about orthogonality of predictions to some factor space (very common in finance), you know that orthogonal predictors will give orthogonal predictions.
I think there are at least two reasons to prefer simpler models when they offer comparable performance. The first is if we convince ourselves that the only way to solve an interesting problem is to train for a month on thousands of GPUs, we are accepting and assuming that the only useful research on that problem can be done by GPU-rich organizations. As you say, this is essentially handing power to the hyperscalers. This is unavoidable if training the biggest model you can on the most data you can find on as many GPUs as you have is really the only way to solve the problem. But if there is a simpler way to solve the same problem using much less data and much less compute, the problem becomes much more accessible and many more people without access to the same resources can work on it, which may potentially lead to a faster solution.
The other I think even more important reason is that always preferring the bigger deep learning model while it may lead to progress in the short term in the long term could actually slow scientific progress. Maybe this is an odd example, but it's the first one I think of. Consider Newton's theory of gravity. If the goal was merely to predict the motions of celestial objects and if Newton had had access to cloud compute, he could undoubtedly have done just as well by training a large transformer to predict where a planet would be at a given point in time. But the theory that Newton did come up with (on the basis of very limited data) turned out to have astonishing predictive power (even if it was later superseded by Einstein's theory) and revolutionized our understanding of the universe, in turn leading to further breakthroughs. Training a transformer might have led to accurate predictions but would not have led to this simple but surprisingly useful theory.
When we say that a problem can ONLY be solved using what are essentially black-box models trained on the most massive datasets we can find, we are essentially assuming it is too complicated for humans to build an accurate mental model of the system. To me this feels a little like how humans pre-scientific revolution assumed the universe was too complicated for mere mortals to understand. It may be that this is true for some problems we care about, but I think this is a dangerous assumption to make because it can actually slow progress by preventing us from developing a mental model that could have led to further advances. If a simpler model can make predictions that are nearly as accurate as the transformer, it may provide us with some insights that may be useful in understanding the system (instead of assuming that we just need more data and more GPUs).
With that said...if the black-box model provides much greater predictive power than the interpretable one...of course, it's what you should use. But the fact that things like Geneformer are popular even though they do not improve on simpler baselines is sort of astonishing.
The Geneformer example is interesting --- the underperformance of a linear baseline shows that the problem is _too_ hard. You don't need a transformer to do linear regression, but when you lack data or grounding to solve a problem meaningfully, you perhaps don't need a transformer either. I suspect the predicting financial markets example may fall into the "too hard" case as well. It seems that these cases are meaningfully distinct even as they can look similar?
I think there's something to this. Deep models shine when Pr[y | x ], the prediction function, is well-approximated by a delta function. When Pr[y | x] is a mess, the interpolation capabilities of deep models are less valuable. I'm not sure this is true, but it matches my experience.
So to take this a step further, if I give you only a subset of the {x} variables, {x}_i, then Pr[y|{x}_i] could be much messier than the delta function we see with Pr[y|{x}], making deep models much less useful?
Some of Cynthia Rudin’s recent work shows some interesting theory that the relative advantage of very complex models goes down exactly when a problem becomes harder because of irreducible error (https://arxiv.org/abs/2310.19726). I haven’t fully absorbed these papers yet but I find them to be a very interesting and useful way to think about when we should expect simple models to be as good as one n get.
Hi Ben, another paper for your list of examples would be "Are Transformers Effective for Time Series Forecasting?" by Zeng et al. (2022). Ironically (or maybe tragically?), some of the follow-up works came up with even more complex models — PatchTST, iTransformer, just to name a couple — to beat the linear baseline.
p.s. It's my first comment, but I've been an avid reader of your blog for quite a while, and I've been very influenced by your work in my own research. Reading these notes is very refreshing in this AI insanity we're witnessing
Hmmm, can we take a more sociological view of your first paragraph?
1. "You are assigned a machine prediction task at work." Why were we assigned this machine prediction task? Were we assigned it because our superiors care about solving it well, or because our superiors care about writing a high-profile paper about it? The former argues for good engineering practice, the latter for conforming to what's currently publishable, which is bespoke transformers or other deep learning shticks.
2. "You train up a bespoke transformer, inventing a few new architectural modifications that get you a lower test error. You also fit a linear classifier to your data. Both models give the same prediction accuracy." To first order and second order I think this story is entirely a story about people who want to publish high-profile papers rather than people who are actually trying to solve a problem. Inside Google in the mid-2020's at least, people who care about solving a problem for business or even science reasons would be vanishingly unlikely to train up a bespoke transformer with problem-specific architectural modifications for any problem linear classifiers could solve. Instead, they'd (a) possibly add their training data and eval data to the overall data mix (advantages: opening up the possibility of "sharing strength" with the model that's already being trained on a gazillion other tasks, no engineering; disadvantages: long-time scale, political negotiations with owners of the data mixture), (b) possibly do a very small amount of fine-tuning, or (c) possibly just stuff their data into a transformer as a prompt consisting of input-output pairs and ask the transformer to solve it.
3. "Which should you use?" I dunno, depends. If we have to maintain and serve the model with code we write ourselves and we're starting from scratch, probably a linear model. If our superiors already pointed us at a deployed transformer stack such that we can "solve the problem" by pushing a new text prompt into a config file, then probably the pretrained transformer from (2c) above. (There was no bespoke transformer.) Even if we have to fine-tune a model and get a bunch of computers to solve it, we're unfortunately deep into the "nobody ever gets fired for using transformers" era.
I certainly want to know if a simpler model can accomplish a task that I care about as well as a more complex model. (As you point out, the “world at large” has many different motivations).
Our models have inductive biases, even if we might not be able to characterise them. We learn something about a task when we learn that certain kinds of models can accomplish that task with certain training. It’s not always very insightful, but it’s often all we have …
One practical argument in favor of using a transformer: a bunch of organizations are spending like a trillion dollars to optimize the performance of these models. You might expect your transformer-based solution to gain faster in performance/ease-of-deployment/etc. relative to the linear baseline. Following industry trends can be rational!
"People feel like [linear models] are easier to interpret (I’m not one of those people)." I'd be interested in some explanation of this. Linear models have coefficents which can be directly interpreted as the effects of explanatory variables. What is the corresponding interpretation of highly nonlinear models with hidden layers.
Interesting piece as always. Quick question for you. Why do you think linear models are not interpretable? Is it because of this somewhat oversimplified analysis of the log odds and the effect of anyone weight on the model outcome ?
We’ve been observing a somewhat similar effect in a different context of predicting power grid time series load data where we find that linear models are just as good as at load forecasting (24 hour periods) as our LSTM models. I still don’t fully understand why but I conjecture it’s because of the Predictable 12 and 24 hour cycles of the data.
Doesn't seem like a new issue. Back in the 2000s, most of those fancy graphical model papers performed worse than nearest neighbor baselines and no one cared. Another great example I always quote is Viola & Jones (2001) getting much more fame than Rowley, Baluja, Kanade (1998) just because the former used fancy novel algorithm.
This is partly in jest, but the graphical model people were so self-righteous about their "principled approach," that one felt it would have been better to just leave them to bask in their right-mindedness.
This is an interesting provocation! Here are two takes on it that I don't think entirely duplicate what's been said so far:
In the specific case of high-profile applied deep learning papers, I read these papers as implicitly making a kind of historical claim: The Deep Learning Revolution, which led to major progress on problems like facial recognition, protein folding, and machine translation, could revolutionize [our field X] too, if we adopt the new techniques. And [paper] presents evidence that it does. For that to hold up, the new techniques have to actually outperform old ones. Any given paper author can of course say they aren't actually arguing this. But I think it's in the background of why people care so much about these papers in the first place.
In a more general sense, I think it is just genuinely interesting to know when you might "need" or even "want" a more complex model. Maybe in somewhat the same way that reverse mathematics is interesting, although the analogy isn't great because provable results are hard to come by.
Thanks for taking on the bitter lesson.
As I was reading this post, I feel that a better version of the bitter lesson are actually Brieman's "Statistical Modeling: The Two Cultures" and the follow up (dare I say mea culpa?) to it almost two decade later by Efron's "Prediction, Estimation, and Attribution".
I've never read Efron's piece. Will check it out!
I cannot recommend that paper highly enough. The single most interesting thing I have read in more than a decade. It changed my thinking about "modern" machine learning.
I read the Breiman paper. It does make some excellent points about data models and their not-so-good explanatory power. The paper is nearly 25 years old, and it shows in that he dismisses the lack of explanatory power over predictive power. This was a failure to understand that scientists want explanations for phenomena. There has been a lot of effort in that regard with ANNs. He was enamoured with random forests, which are good predictors, especially vs. more brittle decision trees. But decision trees are easily interpreted, whilst random forests are not. Because he was not so much interested in explanations, he favored random forests over decision trees and other approaches that are more explanatory.
The way I read the Efron and Brieman papers is that, if you want to do science, find physical laws, then explanations (or as Efron calls it attribution) are a thing.
If prediction is the game you are playing then demanding that the model have an explanation just holds you back.
I read the Efron paper. Interestingly, it included a microarray data example. I had some experience with this at a biotech company I worked for. The problem with microarray data is that it is very noisy due to biological differences between organisms, and the technique. 2 samples, one from each of 2 organisms, treated the same way, can look very poorly correlated.
As with attribution, we were using support vector machines to reduce the gene by 2 orders of magnitude to design a small, simpler array to predict disease. As with teh story in the paper, genes that looked important could be removed and still the SVM would provide low error rates.
More recently, I was trying to replicate a paper that used random forests on mass spec data. The authors managed a less than 10% error rate. I couldn't reduce it more than a few percentage points below 20%.
If one is building a tool, like a diagnostic test, then prediction is all one needs, and attribution of variables for explanation is not needed. However, unless the data replicates well, this can result in a model that is a poor predictor. Early microarray experiments on years with a different microarray type, plotted as a dendrogram, clearly showed that the yeast samples aggregated by date of sample collected and even who was running the macroarrays.
LLMs munge huge amounts of data together to produce the proverbial "lossy JPG" of the corpus. Unfortunately, LLMs don't make predictions, although different LLMs will produce different outputs depending on the questions asked from a particular corpus. LLMs might be a contemporary example of "garbage in, garbage out."
As the Efron paper suggests, prediction engines might only have short-term value, whilst scientific explanations tend to have long-term validity.
Hi Ben, I’m a late fan of your blogs, and as a PhD candidate from systems biology wrestling with foundation models (vision transformers, to be specific), I would like to give my two cents
1. You said that “And if these neural models work on the hard problems, why do we care what happens on easy problems?” — however, we are not sure the model really works that well yet. From my field, I will give an example: in digital pathology, there’s the classical task of subtyping tumor based on its morphology. For prostate and lung carcinoma the tumors can be surprisingly variable in morphology that even humans pathologists don’t always agree in their annotations, and the performance metrics reported in these foundation model papers are not really that high compared to their CNN predecessors. You might even say the label is noisy or the problem is a bit ill defined, even.
2. You asked “Why would we prefer something other than peak out-of-sample performance?” Here, in systems biology, the traditional way was to use ordinary differential equation and stochastic modeling to describe biological systems. These models are interpretable and testable, in that experimental perturbations often have direct connection to changing model parameters, so we can test if the model really holds and, if so, use the model as a quantitative description of the complex biochemical mechanisms at work. Deep learning models, especially the off-the-shelf ones using the hottest architecture of the time, that are published in biological journals don’t yet give the same level of insight as previous models, even though we might more efficiently make predictions (that does help our field too)
Figure 2 of https://arxiv.org/abs/2309.06979 shows decent text generation using a **linear** next token predictor.
Linear transformers (https://manifestai.com/blogposts/faster-after-all/) are behind, but I can't tell if they are behind because linear is bad, or because more effort is spent on tuning regular transformers.
It's unfortunately impossible to know whether linear is bad or just underexplored, but more exploration can't hurt!
Great post! This is related to one of our recommendations in a position paper we drafted about review standards in "AI for social impact" (broadly defined).
Would love to see more discussion of simple baselines in the research community. On the one hand, good performance of a simple baseline is great news for the problem/achieving impacts: much simpler to deploy, much more widely available, can be given away to the masses.
On the other hand, it's bad news for making a compelling case for the superiority of a particular (usually new) method. I suppose this is why it comes up so often in "dunking" on complex methods, in pushing back on a suggestion (implicit or explicit) that the complexity is necessary for good performance - it's a rhetorical defense against overclaiming/AI hype.
Another example is a recent intensive effort on predicting police misconduct: https://www.nber.org/papers/w32432
A lot of the gains from ML can be achieved by using the simpler predictors, using just # prior misconduct complaints, etc.
We recommend that reviewing needs to change in order to normalize simple baselines. Reviewers are used to asking for strong baselines, eg "SOTA". They should also ask for simple baselines. But papers also shouldn't necessarily be penalized if simple baselines work well for a particular dataset -- again, this is great news for machine learning & achieving impact in the field! It's just one evidence point in a broader argument that the paper should make to justify the existence of a new method, whether it be via a more extensive evaluation on more/different datasets, arguing for intrinsic interestingness, etc.
Without changes in review norms, it seems likely that otherwise simple baselines will join the "file drawer". Good performance of baselines shouldn't be viewed as intrinsically negative results - they're great for ML and achieving impact on relevant tasks. Not so great for advertising new complex approaches.
Would you draw a distinction between a purely mechanical linear baseline (adding up some factors in a table, like your police misconduct example) and a statistical linear baseline (fitting logistic regression or some other machine learning model based on past cases)?
Good point! Yeah, I would. I think a purely mechanical linear baseline like in this case (i.e. without trying to fit coefficients) works and is more distinct because the goal is ranking/triage rather than predicting absolute magnitude of risk. (For absolute magnitude, might as well just take averages over historical cases!)
I think we do learn more about the domain by comparing to purely mechanical linear baselines.
It's not obvious before the fact that "common sense" is predictive enough -- I think there's procedural benefits to thoroughly establishing that this is the case by comparing to best-effort ML.
Four quick virtues of linear models for prediction you didn’t mention:
- the artifacts themselves are lighter weight. You can store a jillion per gigabyte and move them around easily.
- determinism of solution radically increases auditability, reproducibility for others with the same data, etc.
- fast training (and few hyperparameters) means your predictor will be done this afternoon rather than whenever you get your compute up and running and your hyperparameters tuned. Combined with determinism you will spend more time on task than if you had to wonder about whether your training run worked or was any good.
- you understand the span of the output space. If you care about orthogonality of predictions to some factor space (very common in finance), you know that orthogonal predictors will give orthogonal predictions.
I think there are at least two reasons to prefer simpler models when they offer comparable performance. The first is if we convince ourselves that the only way to solve an interesting problem is to train for a month on thousands of GPUs, we are accepting and assuming that the only useful research on that problem can be done by GPU-rich organizations. As you say, this is essentially handing power to the hyperscalers. This is unavoidable if training the biggest model you can on the most data you can find on as many GPUs as you have is really the only way to solve the problem. But if there is a simpler way to solve the same problem using much less data and much less compute, the problem becomes much more accessible and many more people without access to the same resources can work on it, which may potentially lead to a faster solution.
The other I think even more important reason is that always preferring the bigger deep learning model while it may lead to progress in the short term in the long term could actually slow scientific progress. Maybe this is an odd example, but it's the first one I think of. Consider Newton's theory of gravity. If the goal was merely to predict the motions of celestial objects and if Newton had had access to cloud compute, he could undoubtedly have done just as well by training a large transformer to predict where a planet would be at a given point in time. But the theory that Newton did come up with (on the basis of very limited data) turned out to have astonishing predictive power (even if it was later superseded by Einstein's theory) and revolutionized our understanding of the universe, in turn leading to further breakthroughs. Training a transformer might have led to accurate predictions but would not have led to this simple but surprisingly useful theory.
When we say that a problem can ONLY be solved using what are essentially black-box models trained on the most massive datasets we can find, we are essentially assuming it is too complicated for humans to build an accurate mental model of the system. To me this feels a little like how humans pre-scientific revolution assumed the universe was too complicated for mere mortals to understand. It may be that this is true for some problems we care about, but I think this is a dangerous assumption to make because it can actually slow progress by preventing us from developing a mental model that could have led to further advances. If a simpler model can make predictions that are nearly as accurate as the transformer, it may provide us with some insights that may be useful in understanding the system (instead of assuming that we just need more data and more GPUs).
With that said...if the black-box model provides much greater predictive power than the interpretable one...of course, it's what you should use. But the fact that things like Geneformer are popular even though they do not improve on simpler baselines is sort of astonishing.
The Geneformer example is interesting --- the underperformance of a linear baseline shows that the problem is _too_ hard. You don't need a transformer to do linear regression, but when you lack data or grounding to solve a problem meaningfully, you perhaps don't need a transformer either. I suspect the predicting financial markets example may fall into the "too hard" case as well. It seems that these cases are meaningfully distinct even as they can look similar?
I think there's something to this. Deep models shine when Pr[y | x ], the prediction function, is well-approximated by a delta function. When Pr[y | x] is a mess, the interpolation capabilities of deep models are less valuable. I'm not sure this is true, but it matches my experience.
So to take this a step further, if I give you only a subset of the {x} variables, {x}_i, then Pr[y|{x}_i] could be much messier than the delta function we see with Pr[y|{x}], making deep models much less useful?
Some of Cynthia Rudin’s recent work shows some interesting theory that the relative advantage of very complex models goes down exactly when a problem becomes harder because of irreducible error (https://arxiv.org/abs/2310.19726). I haven’t fully absorbed these papers yet but I find them to be a very interesting and useful way to think about when we should expect simple models to be as good as one n get.
Hi Ben, another paper for your list of examples would be "Are Transformers Effective for Time Series Forecasting?" by Zeng et al. (2022). Ironically (or maybe tragically?), some of the follow-up works came up with even more complex models — PatchTST, iTransformer, just to name a couple — to beat the linear baseline.
p.s. It's my first comment, but I've been an avid reader of your blog for quite a while, and I've been very influenced by your work in my own research. Reading these notes is very refreshing in this AI insanity we're witnessing
Hmmm, can we take a more sociological view of your first paragraph?
1. "You are assigned a machine prediction task at work." Why were we assigned this machine prediction task? Were we assigned it because our superiors care about solving it well, or because our superiors care about writing a high-profile paper about it? The former argues for good engineering practice, the latter for conforming to what's currently publishable, which is bespoke transformers or other deep learning shticks.
2. "You train up a bespoke transformer, inventing a few new architectural modifications that get you a lower test error. You also fit a linear classifier to your data. Both models give the same prediction accuracy." To first order and second order I think this story is entirely a story about people who want to publish high-profile papers rather than people who are actually trying to solve a problem. Inside Google in the mid-2020's at least, people who care about solving a problem for business or even science reasons would be vanishingly unlikely to train up a bespoke transformer with problem-specific architectural modifications for any problem linear classifiers could solve. Instead, they'd (a) possibly add their training data and eval data to the overall data mix (advantages: opening up the possibility of "sharing strength" with the model that's already being trained on a gazillion other tasks, no engineering; disadvantages: long-time scale, political negotiations with owners of the data mixture), (b) possibly do a very small amount of fine-tuning, or (c) possibly just stuff their data into a transformer as a prompt consisting of input-output pairs and ask the transformer to solve it.
3. "Which should you use?" I dunno, depends. If we have to maintain and serve the model with code we write ourselves and we're starting from scratch, probably a linear model. If our superiors already pointed us at a deployed transformer stack such that we can "solve the problem" by pushing a new text prompt into a config file, then probably the pretrained transformer from (2c) above. (There was no bespoke transformer.) Even if we have to fine-tune a model and get a bunch of computers to solve it, we're unfortunately deep into the "nobody ever gets fired for using transformers" era.
I agree and like your more realistic hypothetical in 3.
I certainly want to know if a simpler model can accomplish a task that I care about as well as a more complex model. (As you point out, the “world at large” has many different motivations).
Our models have inductive biases, even if we might not be able to characterise them. We learn something about a task when we learn that certain kinds of models can accomplish that task with certain training. It’s not always very insightful, but it’s often all we have …
One practical argument in favor of using a transformer: a bunch of organizations are spending like a trillion dollars to optimize the performance of these models. You might expect your transformer-based solution to gain faster in performance/ease-of-deployment/etc. relative to the linear baseline. Following industry trends can be rational!
"People feel like [linear models] are easier to interpret (I’m not one of those people)." I'd be interested in some explanation of this. Linear models have coefficents which can be directly interpreted as the effects of explanatory variables. What is the corresponding interpretation of highly nonlinear models with hidden layers.
I'd love to hear what you think of their rebuttal:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5346842
That opening quote is laughable. I am going to write an angry email to Misha Belkin.
“It is an empirical fact that more is better in modern machine learning." Except when it isn't, I guess.
Well, it depends on what "more" refered to. Maybe "more simplicity" also counts. :-)
Interesting piece as always. Quick question for you. Why do you think linear models are not interpretable? Is it because of this somewhat oversimplified analysis of the log odds and the effect of anyone weight on the model outcome ?
We’ve been observing a somewhat similar effect in a different context of predicting power grid time series load data where we find that linear models are just as good as at load forecasting (24 hour periods) as our LSTM models. I still don’t fully understand why but I conjecture it’s because of the Predictable 12 and 24 hour cycles of the data.
I wonder if this is the same phenomenon that Jacob describes in his comment?
Is it that it's very easy to forecast load or is it that forecasting load is too hard for any model?