In 2017, Dave Donoho taught a course on “theories of deep learning.” His first lecture highlights the technical and sociological advances of the twentieth century, connecting the information technology explosion of the 1990s to the inevitability of extracting patterns from data for profit margins. He sees “deep learning” as a complex endpoint of many historical and sociological trends. But then asks, is there a place for mathematics in all of this? Slide 46 features Dave’s deathly metaphor: Deep Learning as a Magic Mirror.
Every Theorist Who Looks At It Sees What They Wish.
He’s right. We stare into the void where our math fails us and try to write math papers anyway. Ironically, Dave’s lecture ends with a survey of functional analysis, approximation theory, and wavelets. No one is immune from the magic mirror.
I have a few historical nitpicks, but overall agree with everything up to slide 46. My biggest disagreement is I think non-deep learning (whatever that means) is no better understood than deep learning. It’s all a mess.
I wrote this in a comment on Max Raginsky’s substack, but I’ll say it here. Why is it that there are some things we teach children (arithmetic, spelling), that we can write simple code for, and others (identifying flash cards) where we have to rely on blind, inefficient, data-driven pattern recognition approaches? Move over P=NP, this is now the most vexing question in computer science.
As I’ve argued so far in this class, mathematical theory hasn’t helped address this question. We can write very technical papers on optimization, but optimization in machine learning doesn’t matter much. We only care about generalization. To tackle generalization, we can write complex theories of empirical processes, but these don’t matter because our samples aren’t random, and the theory provides no guidance for external validity. We could then turn to the deepness itself and prove things about batch norm or dropout or whatever, but these just give us some nonpredictive post hoc justifications. And, as evinced in some of the comments here and private correspondence, deep learning also seems to drive people completely insane.
So what does this say about the role of theory in machine learning? I have three final points to close out this section of the course and look forward to the remainder of the semester.
First, there is some credence to the argument that perhaps you don’t need to know any theory in machine learning. Ludwig Schmidt, who spent a lot of time doing a lot of theory, maintains that the only thing needed for pattern recognition is large data sets and massive computing budgets. The big tech companies are all certainly buying in on this bet. Data has been the new oil for a decade. AI the new electricity. Where’s my rocket emoji?
Second, I am not against mathematical theory in general. Optimization theory is useful if you actually care about optimizing! And statistics, when properly used, can provide predictive, informatice error quantification. In appropriate contexts, mathematical theory is empowering and can guide practice. I pointed to some theory that I found reasonable in the first half of the course: regret bounds for linear predictions, simple generalization bounds justifying the train-test split. These give us reasonable, though minimal, advice. As we move into the second half of the course, we’ll see other examples in experiment design, dynamic programming, and optimal control where theory offers actionable directions. But in the case of pattern recognition, we’re weirdly stuck.
Third, maybe it’s the mathematics that’s holding us back. Why does theory necessarily need to be mathematical? This leads us to the next three lectures. Today’s lecture will be a pivot, talking about the longstanding historical disconnect between theory and practice in machine learning. In the following lecture, we’ll discuss the main evaluation paradigm of competitive testing and its history in the field (Donoho calls this the Common Task Framework, but competitive testing is a much larger subject). And then we’ll try to understand why and how competitive testing on fixed benchmarks advances machine learning technology. Let’s see if we can build up any theory from a deep dive into history and sociology.
To your point about tasks where it's easy to write a program to solve it vs having to rely on DL, we can see the same thing with the creation of complex patterns. If you didn't know, say, how to write reaction diffusion equations and solve them, some relatively simple 3 parameter images look endlessly complex. But you can't describe them easily with typical equations --- you need solved PDE's.
The lurking definition of parameter therein has troubled me since grad school.
A Unified Theory - Universal Language https://www.linkedin.com/pulse/unified-theory-consciousness-michael-molin/