Would you say deep learning is mature enough to be taught in high schools?
Here’s why I ask. Some time ago, I received an email from a product manager at a very large company. I love sharing personal emails publicly, so I’ll post it here:
I sat on this email for days. I couldn’t come up with a constructive answer.
If anything, I wanted to reply that maybe her engineers should be scared.
Imagine you’re an engineer, you’re given this net, and you’re asked to make it work better on a dataset. You might presume each of these layers is there for a reason. But as a field, we don’t yet have a common language to express these reasons. The way we teach deep learning is very different from the way we teach other technical disciplines.
A few years ago, I got into optics. In optics, you also build stacks of components that process inputs. Here’s a camera lens.
To design something like this, you start with basic stackup, usually named after the famous person who invented it, you simulate it to determine how it falls short of your requirements, then insert additional components to correct the shortcomings.
You then run the system through a numerical optimizer to tune its parameters, like the shape of the surfaces, their locations, and their inclinations, to maximize some design objective. Then you simulate, again, modify, optimize, and repeat.
Just like with a deep net.
Each of the 36 elements in the stackup is deliberately inserted to correct an aberration of some kind. This requires a very clear mental model for what each component does to the light that passes through it. That mental model is usually in terms of an action, like refraction, reflection, diffraction, dispersion, or wavefront correction.
People don’t fear this design process. Each year, the US graduates hundreds of optical engineers who go on to design lenses that work. And they’re not scared of their job.
That’s not because optics is easy. It’s because the mental models in optics are organized well.
Modern optics is taught in layers of abstraction.
At the simplest layer, at the top, is ray optics. Ray optics is a simplification of wave optics, where the rays represent the normal vectors of the wavefronts in wave optics. Wave optics is an approximate solution to Maxwell’s equations. And Maxwell’s equations can be derived from quantum physics, which I don’t actually understand.
Each layer is derived from the layer below it, by making simplifying assumptions. As such, each layer can explain a more sophisticated set of phenomena than the layer above it.
I spend most of my time designing systems at the top four levels of abstraction.
This is how we teach optics today. But these theories weren’t always organized in a stack like this. Until a hundred years ago, some of these theories co-existed in a conflicting state. Practitioners relied on folkloric theories of light.
That didn’t stop Galileo from building a pretty good telescope nearly a hundred years before Newton formalized ray optics. Galileo had a good enough mental model of light to build a telescope that could magnify 10x. But his understanding wasn’t good enough to correct chromatic aberrations, or to get a wide field of view.
Before these theories of light were unified in stack of abstractions, each theory had to start with a fundamental concept of light from the ground up. This involves making up a fresh set of unrealistic assumptions. Newton’s ray optics modeled light as a spray of particles that could be attracted to, or repelled from, solid matter. Huygens modeled light as a longitudinal pressure wave through a mystical medium called “ether”. He modeled light like sound. Maxwell also assumed light propagates through ether. You can still see the remnant of this in the names of the coefficients in Maxwell’s equations.
Silly Models, yes. But quantitative, and Predictive.
As silly as these assumptions may sound today, these models were quantitative and they had predictive power. You could plug numbers into them and get numerical predictions out. That’s tremendously useful to an engineer.
Seeking: A modular language for deep learning to describe the action of each layer.
It would be easier to design deep nets if we could talk about the action of each of its layers the way we talk about the action of an optical element on the light that passes through it.
We talk about convolutional layers as running matched filters against their inputs, and the subsequent nonlinearities as pooling. This is a relatively low-level description, akin to describing the action of a lens in terms of Maxwell’s equations.
Maybe there are higher level abstractions to rely on, in terms of a quantity that is modified as it passes through the layers of a net, akin to the action of lens in terms of how it bends rays.
And it would be nice if this abstraction were quantitative so you could plug numbers into a formula to run back-of-the-envelope analyses to help you design your network.
We’re far from this kind of language. Let’s start with something simpler.
But I’m getting carried away with fantasies.
Let’s start with something simpler. We have lots of mental models of how deep net training works. I’ve collected a sampling of phenomena worth explaining. Let’s see how well these mental models explain these phenomena.
Before I get too far, I admit that this exercise is doomed. Optics had 300 years to do this. I spent a Saturday afternoon on it. In exchange I’m just reporting my findings in a blog.
Phenomenon: A random initializer for SGD is good enough. But small numerical errors thereafter, or bad step sizes, break SGD.
Some practitioners have observed that small changes in how gradients are accumulated can result in vastly different performance on the test set. This comes up for example, when you train with a GPU instead of a CPU (1, 2).
Do you think this is a legitimate observation that’s worth explaining?
Or do you think this a spurious observation, and it’s probably untrue?
Or maybe you think there is something wrong with this observation, like it’s logically self-contradictory in some way, or it’s a malformed claim?
Sentiments are mixed, I’m sure, but let me record this as a phenomenon and move on.
Phenomenon: Shallow local minima generalize better than sharp ones.
This claim is fashionable. Some people insist it’s true (3, 4, 5, 6), and others, including myself, believe that it can’t be true on logical grounds. Others have refuted the claim on empirical grounds (7). Yet others have offered refined variants of this claim (8). Confusion remains (9).
I’ll note that this phenomenon might be disputed, but record it nevertheless.
Phenomenon: Inserting Batch Norm layers speeds up SGD.
That batchnorm works is mostly undisputed. I offer only one redoubtable exception (10)), and record this as a phenomenon with no reservation.
Phenomenon: SGD is successful despite many local optima and saddle points.
There are several claims packed in this one. It is often claimed that the training loss surface of deep learning is rife with saddle points and local minima (11). In addition, it is variously claimed either that gradient descent can traverse these hazards (12), or that it need not traverse these hazards to produce a solution that generalizes well (13). It is also just as often claimed that the loss surface of deep models is altogether benign (14).
I’ll record this a phenomenon, begrudgingly.
Phenomenon: Dropout works better than other randomization strategies.
I don’t know how to categorize the class of dropout-like things, so I’m calling them “randomization strategies”.
Recorded with apologies and without discussion.
Phenomenon: Deep nets can memorize random labels, and yet, they generalize.
The evidence is plain (15), witnessed and supported by dear friends.
Recorded as a phenomenon despite the conflict of interest.
We’ve gathered some phenomena. From the papers I cited above, I’ve also gathered theories that I think have the best chance of explaining these phenomena.
Let’s see how we fare:
There are a few reasons to not get excited.
First, we couldn’t agree on some of the observations we were trying to explain in the first place.
Second, I couldn’t organize the explanations into a hierarchy of abstractions. What made optics easy to learn isn’t at all manifest here.
Third, I suspect some of the theories I’m quoting from the papers aren’t true.
There’s a mass influx of newcomers to our field and we’re equipping them with little more than folklore and pre-trained deep nets, then asking them to innovate. We can barely agree on the phenomena that we should be explaining away. I think we’re far from teaching this stuff in high schools.
How do we get there?
It would be nice if we could provide mental models, at various layers of abstraction, of the action of the layers of a deep net. What could be our equivalent of refraction, dispersion, and diffraction? Maybe you already think in terms of these actions, but we just haven’t standardized our language around these concepts?
Let’s gather a set of phenomena that we all agree on. We could then try to explain them away. What is our equivalent of Newton Rings, the Kerr effect, or the Faraday effect?
A small group of colleagues and I have started a massive empirical effort to catalogue mental models that are ambient in our field, to formalize them, and to then validate them with experiments. It’s a lot of work. I think it’s the first step to developing a layered mental model of deep learning that you could teach in high schools. If you want to chat about this project or to help out, please tweet me.