7 Comments
User's avatar
Jacob N Oppenheim's avatar

The Geneformer example is interesting --- the underperformance of a linear baseline shows that the problem is _too_ hard. You don't need a transformer to do linear regression, but when you lack data or grounding to solve a problem meaningfully, you perhaps don't need a transformer either. I suspect the predicting financial markets example may fall into the "too hard" case as well. It seems that these cases are meaningfully distinct even as they can look similar?

Expand full comment
Ben Recht's avatar

I think there's something to this. Deep models shine when Pr[y | x ], the prediction function, is well-approximated by a delta function. When Pr[y | x] is a mess, the interpolation capabilities of deep models are less valuable. I'm not sure this is true, but it matches my experience.

Expand full comment
Jacob N Oppenheim's avatar

So to take this a step further, if I give you only a subset of the {x} variables, {x}_i, then Pr[y|{x}_i] could be much messier than the delta function we see with Pr[y|{x}], making deep models much less useful?

Expand full comment
JS's avatar

I'd love to hear what you think of their rebuttal:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5346842

Expand full comment
Ben Recht's avatar

That opening quote is laughable. I am going to write an angry email to Misha Belkin.

“It is an empirical fact that more is better in modern machine learning." Except when it isn't, I guess.

Expand full comment
Lalitha Sankar's avatar

Interesting piece as always. Quick question for you. Why do you think linear models are not interpretable? Is it because of this somewhat oversimplified analysis of the log odds and the effect of anyone weight on the model outcome ?

We’ve been observing a somewhat similar effect in a different context of predicting power grid time series load data where we find that linear models are just as good as at load forecasting (24 hour periods) as our LSTM models. I still don’t fully understand why but I conjecture it’s because of the Predictable 12 and 24 hour cycles of the data.

Expand full comment
Ben Recht's avatar

I wonder if this is the same phenomenon that Jacob describes in his comment?

Is it that it's very easy to forecast load or is it that forecasting load is too hard for any model?

Expand full comment