I really enjoyed the lively conversation about simple baselines in the comments of Monday’s post. Commenters flagged many interesting examples in computer vision, social science, systems biology, and energy. They also had some excellent further motivations for simple baselines, including transportability of implementation, appeals to “Occam’s Razor,” or the power of multiply predictive mechanistic models. Regardless of the reasons, it’s not controversial that people would prefer simpler code bases if it were possible. Even if everything you read on AI Twitter suggests otherwise.
One of my ulterior motives in stoking this conversation was pulling out why most people believe linear statistical models are simple baselines. I wrote on Monday:
People feel like [linear models] are easier to interpret (I’m not one of those people).
A few folks didn’t agree. Let me use this as a starting point to ask what we mean by “simple.”
Probably the simplest nonlinear statistical models that we study are polynomials. Rather than define polynomials using the conventions we learned in algebra, let me pedantically motivate them from the perspective of feature engineering. If you multiply any two numerical variables, you get a new variable. Sometimes this multiplication has semantics: if A is equal to one if a cat is orange and B is equal to one if a cat is crazy, then A times B is equal to one if a cat is orange and crazy. Multiplication of variables serves as a logical AND of the associated features of the individual. For real-valued features, these products inherit new semantics not captured in models that only include the individual features. Hence, you can recursively select and multiply variables to build a long list of features from an initial basis set. If you fit a linear model to this long list, you have yourself a polynomial of the original features.
Is this polynomial model interpretable? Even if you started off with a bunch of simple binary features (like cat colors and personality traits), you’ll get so many coefficients for your predictive model that it’s not clear to me you can read off a meaningful answer. The combinatorics of polynomials quickly begets enormous lists of features. There are nearly 5,000 combinations of four variables chosen from an initial set of 20 features. Is a linear model with 5,000 nonzero coefficients interpretable?
Maybe you say yes. Then let me give you another linear model that is popular in machine learning. You want to classify some text documents. For features, you use a pretrained language model to create an embedding of the text in a 256 dimensional space. You train a linear prediction function on top of these 256 features. Is this model interpretable?
What about if I’m fitting a kernel machine to my data? Kernel methods fit linear models. The coefficient set just happens to usually be infinite. We surely wouldn’t say that an infinite set of coefficients is simple and interpretable, would we?
I don’t even have to be so weird about my models. If you are running some clinical trial and do an exploratory analysis fitting logistic regression on 100 biomarkers, and you find six of the coefficients are large on some markers characteristic of inflammation, what exactly have you found? You have discovered that sick people have inflammation, but any further interpretation is quite suspect. In many problems where we fit linear models, the features themselves are complex nonlinear features of what we actually care about. Moreover, since everything is correlated with everything, our coefficient sets don’t tell us much other than one variable is predictable from all the rest.
What we mean by interpretable and simple is very subjective!
I write this because there was a time in the 2010s when I thought linear models were simple, too. I’ve since changed my mind. I liked linear models because they came with a convenient abstraction boundary. If you provided me with the features you liked, I had a clean, algorithmic framework to build a high-performing nonlinear classifier with them. This was true whether the model on top was a linear function, a polynomial, a kernel machine, or a random forest. Even though the last three in that list are nonlinear models of the features, they are all linear models from the perspective of numerical optimization.
However, because I hadn’t really dug too deeply into the applications like computer vision or language processing, I didn’t realize that the features themselves were complicated nonlinear models. Moreover, those feature models were not clean and tidy at all. Edge detectors are not guaranteed to detect all edges. Edges are not well-defined concepts in images anyway. Modules that take the output of edge detectors and predict higher order shapes were similarly poorly defined without any formal guarantees. You could certainly take a lot of modules in OpenCV, chain them together, and train a linear model on top. But you could also spend the weeks before your paper deadline sweating over how best to tune the hyperparameters of each individual box to maximize performance. Was this sort of illusion of modularity better than combining more abstract modules of matrix multiplication and nonlinearity that mimicked the spatial and sequential structure of data? That is, was stitching together pipelines of features and fine tuning their parameters “more interpretable” or “more principled” than building differentiable “deep networks?” I can blog more about the details of the 2nd place entry in the ImageNet competition if anyone is interested. In hindsight, it’s far more complicated than a ResNet.
I still think people jump to deep nets far too soon on many problems. Nonetheless, I can make a case for deep nets being useful and principled beyond that they “leverage computation.” They provide a standard language. They help inform how to write programming languages that support nonparametric pipelines trained on gazillions of examples. They narrow the set of design choices by (mostly) forcing people to think about desirable properties of chain rule cascades. These aren’t nothing.
And for principled engineering systems, the illusion of abstraction boundaries might be worse than no abstraction boundaries at all. That’s a reformulation of the bitter lesson (notably, not at all what Sutton said) that I agree with: we fooled ourselves into believing that pipelines for machine learning stood upon rigorous foundations. The details were always a mess, and there wasn’t a unified language of what was acceptable. When it comes to features, “anything goes” has always been the design principle.
That said, it would be nice if we had something a little bit clearer to offer people than “anything goes.” A downside to competitive testing at all costs is that there are other design objectives that we often want from our software beyond univariate performance. Wouldn’t it be nice if there were a way to think about the differentiable programming languages we’ve built and then motivate how to build things from scratch rather than promoting a “git pull and patch” evolutionary design?
You might like this blog post of mine from 2017: "Are Linear Models *Actually* 'Easily Interpretable'?" --> https://www.linkedin.com/pulse/linear-models-actually-easily-interpretable-michael-a-alcorn/. I focused specifically on how tempting it is for people to interpret linear models causally even when it's completely unjustified.
If you want, you can even drill down to the transistor level to marvel at all the layers of physical nonlinearity that have to be composed before you get to the abstraction of digital logic, which you can wrap in further layers of abstraction to get to Python code that you use to implement your linear predictor.