9 Comments
User's avatar
Maxim Raginsky's avatar

If you want, you can even drill down to the transistor level to marvel at all the layers of physical nonlinearity that have to be composed before you get to the abstraction of digital logic, which you can wrap in further layers of abstraction to get to Python code that you use to implement your linear predictor.

Expand full comment
Ben Recht's avatar

100%. I mean, dude, what is a pixel?

Expand full comment
Maxim Raginsky's avatar

Pixels are Alyosha’s thing.

Expand full comment
Michael A. Alcorn's avatar

You might like this blog post of mine from 2017: "Are Linear Models *Actually* 'Easily Interpretable'​?" --> https://www.linkedin.com/pulse/linear-models-actually-easily-interpretable-michael-a-alcorn/. I focused specifically on how tempting it is for people to interpret linear models causally even when it's completely unjustified.

Expand full comment
Ben Recht's avatar

Yes. People have been interpreting linear models causally since regression was first applied to the social sciences invented by Yule!

You'll dig this paper by David Freedman on this topic: https://projecteuclid.org/journals/statistical-science/volume-14/issue-3/From-association-to-causation--some-remarks-on-the-history/10.1214/ss/1009212409.full

Expand full comment
Misha Belkin's avatar

I think the basic issue here is that it is unclear what linearity means. We have a process which produces some complex embedding (even infinite-dimensional, as is often the case with kernels) and then apply a linear predictor to the output. While linear functions are nice and all, the “interpretability” or “simplicity” fundamentally depend on the details of that embedding. Perhaps one can argue that linearity really is an illusion as a reference to the underlying process (but real with respect to the algorithm).

Furthermore, even in the most physically grounded cases, when features have actual physical meanings, taking arbitrary linear combinations of, say, mass, acceleration and density is quite strange from the physics point of view because of the units considerations.

On the other hand, there is the fundamental reality that most processes can be locally approximated by linear functions, at least when measured in certain “correct” ways. It seems that there is some confusion there because that type of linearity (local smoothness) seems quite different from the linearity of “linear” predictors.

Expand full comment
Hostile Replicator's avatar

Interesting how much discussion there still is around “simplicity” and “interpretability”. Zach Lipton made some similar points nearly 10 years ago, and I’m sure he wasn’t the first.

Great series- would love to hear more about the 2nd place 2012 ILSVRC entry, I wasn’t following CV at the time and it was all about CNNs by the time I started in that area!

Expand full comment
Michael Craig's avatar

Great read, have always found it difficult to interpret learned linear models

Although I would say linear models are more explainable when used for control, since assuming certain constraints on feature transforms, the gradients of your output wrt the control variables are fixed

Conversely, with NNs where the gradients are variable and often change signs in ways I can't explain, so I can't confidently stand behind using such models for MPC unless it's far more accurate. Even then, could imagine situations where the predictable linear model would be better, if you wanted to do inverse control/RL or something

Expand full comment
James Golden's avatar

This is well known but I love the Foerster post on forcing deep linear networks to utilize floating point nonlinearities (trained with an evolutionary algorithm devised for this purpose, not standard sgd or similar) boosting test set performance on mnist to 96% when the same network with weights on a normal scale which is actually linear only reaches 92%.

https://openai.com/index/nonlinear-computation-in-deep-linear-networks/

Also well known, but I really like the Simoncelli group using the numerical Jacobian to exactly linearly reconstruct the output of a relu convolutional network for image diffusion, and doing interpretation of the jacobian as adaptive linear image filters.

https://arxiv.org/abs/2310.02557

Linear can be nonlinear, and nonlinear can be locally linear?

Expand full comment