The Geneformer example is interesting --- the underperformance of a linear baseline shows that the problem is _too_ hard. You don't need a transformer to do linear regression, but when you lack data or grounding to solve a problem meaningfully, you perhaps don't need a transformer either. I suspect the predicting financial markets example may fall into the "too hard" case as well. It seems that these cases are meaningfully distinct even as they can look similar?
I think there's something to this. Deep models shine when Pr[y | x ], the prediction function, is well-approximated by a delta function. When Pr[y | x] is a mess, the interpolation capabilities of deep models are less valuable. I'm not sure this is true, but it matches my experience.
So to take this a step further, if I give you only a subset of the {x} variables, {x}_i, then Pr[y|{x}_i] could be much messier than the delta function we see with Pr[y|{x}], making deep models much less useful?
Interesting piece as always. Quick question for you. Why do you think linear models are not interpretable? Is it because of this somewhat oversimplified analysis of the log odds and the effect of anyone weight on the model outcome ?
We’ve been observing a somewhat similar effect in a different context of predicting power grid time series load data where we find that linear models are just as good as at load forecasting (24 hour periods) as our LSTM models. I still don’t fully understand why but I conjecture it’s because of the Predictable 12 and 24 hour cycles of the data.
The Geneformer example is interesting --- the underperformance of a linear baseline shows that the problem is _too_ hard. You don't need a transformer to do linear regression, but when you lack data or grounding to solve a problem meaningfully, you perhaps don't need a transformer either. I suspect the predicting financial markets example may fall into the "too hard" case as well. It seems that these cases are meaningfully distinct even as they can look similar?
I think there's something to this. Deep models shine when Pr[y | x ], the prediction function, is well-approximated by a delta function. When Pr[y | x] is a mess, the interpolation capabilities of deep models are less valuable. I'm not sure this is true, but it matches my experience.
So to take this a step further, if I give you only a subset of the {x} variables, {x}_i, then Pr[y|{x}_i] could be much messier than the delta function we see with Pr[y|{x}], making deep models much less useful?
I'd love to hear what you think of their rebuttal:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5346842
That opening quote is laughable. I am going to write an angry email to Misha Belkin.
“It is an empirical fact that more is better in modern machine learning." Except when it isn't, I guess.
Interesting piece as always. Quick question for you. Why do you think linear models are not interpretable? Is it because of this somewhat oversimplified analysis of the log odds and the effect of anyone weight on the model outcome ?
We’ve been observing a somewhat similar effect in a different context of predicting power grid time series load data where we find that linear models are just as good as at load forecasting (24 hour periods) as our LSTM models. I still don’t fully understand why but I conjecture it’s because of the Predictable 12 and 24 hour cycles of the data.
I wonder if this is the same phenomenon that Jacob describes in his comment?
Is it that it's very easy to forecast load or is it that forecasting load is too hard for any model?