Inference From the Best Prediction?

Aug 20

The allure of equating simplicity and linearity.

12 Comments

If you want, you can even drill down to the transistor level to marvel at all the layers of physical nonlinearity that have to be composed before you get to the abstraction of digital logic, which you can wrap in further layers of abstraction to get to Python code that you use to implement your linear predictor.

Expand full comment

Reply (1)

Ben Recht

Aug 20Edited

100%. I mean, dude, what is a pixel?

Expand full comment

Reply (1)

Maxim Raginsky

Aug 20

Pixels are Alyosha’s thing.

Expand full comment

Michael A. Alcorn

Aug 20

You might like this blog post of mine from 2017: "Are Linear Models *Actually* 'Easily Interpretable'?" --> https://www.linkedin.com/pulse/linear-models-actually-easily-interpretable-michael-a-alcorn/. I focused specifically on how tempting it is for people to interpret linear models causally even when it's completely unjustified.

Expand full comment

Reply (1)

Ben Recht

Aug 20

Yes. People have been interpreting linear models causally since regression was first applied to the social sciences invented by Yule!

You'll dig this paper by David Freedman on this topic: https://projecteuclid.org/journals/statistical-science/volume-14/issue-3/From-association-to-causation--some-remarks-on-the-history/10.1214/ss/1009212409.full

Expand full comment

Daniel Maturana

Aug 21

Well put. Deep learning happened to explode in computer vision about a year after I started my PhD, so (unlike some youngins) I'm pretty familiar with "classical" methods. But these days, if I need to do some task that will need more than one step using classical computer vision techniques, I often put down opencv and reach for pytorch instead. Because these opencv-chains usually devolve into what we used to call the "hierarchical bag of hacks approach". I'll often get better results from a neural network fine-tuned on a few dozen hand-labeled examples and overall it'll need less time and energy than faffing around with a bunch of non-learnable parameters.

Expand full comment

Misha Belkin

Aug 20

I think the basic issue here is that it is unclear what linearity means. We have a process which produces some complex embedding (even infinite-dimensional, as is often the case with kernels) and then apply a linear predictor to the output. While linear functions are nice and all, the “interpretability” or “simplicity” fundamentally depend on the details of that embedding. Perhaps one can argue that linearity really is an illusion as a reference to the underlying process (but real with respect to the algorithm).

Furthermore, even in the most physically grounded cases, when features have actual physical meanings, taking arbitrary linear combinations of, say, mass, acceleration and density is quite strange from the physics point of view because of the units considerations.

On the other hand, there is the fundamental reality that most processes can be locally approximated by linear functions, at least when measured in certain “correct” ways. It seems that there is some confusion there because that type of linearity (local smoothness) seems quite different from the linearity of “linear” predictors.

Expand full comment

Chris Peterson

Aug 24

I loved this framing around feature engineering! At least for recommendation systems, some complicated transformer-based models offer appealing simplifications to feature engineering. Instead of computing and serving a bunch of counts around previous interactions by author or genre, you just send the model each user's recent interaction history.

A bit much but https://arxiv.org/abs/2402.17152

Expand full comment

Allen Schmaltz

Aug 22

In machine learning, for models with non-identifiable parameters, the unified language and foundation we use is Similarity-Distance-Magnitude (SDM) partitioning, and the effective sample size therein, as with SDM estimators and networks, from which we additionally obtain interpretability-by-exemplar, and in that context, the single objective we care about is maximizing the proportion of admitted points at the desired probability threshold.

Expand full comment

Mark Johnson

Aug 21

I think you’re dancing around important topics here. It would be great to have a theory that could explain which kinds of phenomena can be captured by which kinds of models. But as far as I know, the best we have is “try it and see”.

At least in NLP, it really does seem that multi-layer models are better at capturing linguistic generalisations than the linear models of products of features you mention, even if both are non-parametric as the number of features (e.g., n-grams for larger n) grows.

A few months ago you were discussing stuff like bias-variance trade-offs and double descent. Convexity is perhaps something you could add to that list. We used linear models because we worried about non-convexity, but these days that just seems quaint.

Expand full comment

Hostile Replicator

Aug 20

Interesting how much discussion there still is around “simplicity” and “interpretability”. Zach Lipton made some similar points nearly 10 years ago, and I’m sure he wasn’t the first.

Great series- would love to hear more about the 2nd place 2012 ILSVRC entry, I wasn’t following CV at the time and it was all about CNNs by the time I started in that area!

Expand full comment

James Golden

Aug 20Edited

This is well known but I love the Foerster post on forcing deep linear networks to utilize floating point nonlinearities (trained with an evolutionary algorithm devised for this purpose, not standard sgd or similar) boosting test set performance on mnist to 96% when the same network with weights on a normal scale which is actually linear only reaches 92%.

https://openai.com/index/nonlinear-computation-in-deep-linear-networks/

Also well known, but I really like the Simoncelli group using the numerical Jacobian to exactly linearly reconstruct the output of a relu convolutional network for image diffusion, and doing interpretation of the jacobian as adaptive linear image filters.

https://arxiv.org/abs/2310.02557

Linear can be nonlinear, and nonlinear can be locally linear?

Expand full comment

arg min

Inference From the Best Prediction?