No one knows how the brain works.
Is that why we believed it was hard to train neural networks?
When I was in grad school, I was told countless times that it was hard to fit neural nets to data. I tried for myself in Matlab 20 years ago and convinced myself that everyone was right. But why were we all wrong?
It was true that coding up neural nets from scratch was super annoying. I could write a regularized least-squares classifier in three lines of Matlab. Writing my own automatic differentiation package for a convolutional neural network? Ugh. No thanks.
Children today ask, wasn’t there free software? Hah, to be so young and naive. Github only started in 2008. Before 2010, there were some free packages for SVMs or what have you, but you were never sure if they’d actually run. And if you wanted to see someone’s research code, forget about it. My success rate for getting people to share code from their papers was approximately 0%. Dark times, I yell at the clouds.
But it also wasn’t clear that getting people's research code was worth the headache. There were few problems where local search on huge models made much of a difference for the data sets of the time. For the most part, this remains true today. You don’t need to use a neural network for most “tabular data,” whatever “tabular” means. Gradient boosting, logistic regression, or whatever you try will probably work just fine.
But I’m annoyed that a strike against neural nets was our belief that optimization was computationally intractable. This was a dumb story. I can certainly make neural nets hard to optimize by hiding the standard design tricks. Fitting small neural nets to data can be a real pain in the butt, especially with poor weight intialization. But once you make the neural net large enough, they’re pretty effortless to fit. Why was this not conventional wisdom?
Partially, the answer is that we trusted complexity theory too much. In the 90s, theorists showed most of what we do in machine learning is “computationally intractable.” They even convinced themselves that training perceptrons was hard. Technically speaking, if you have a data set that is not perfectly classified by a linear rule, then finding the linear classifier that makes the fewest errors is NP-hard. But who are you going to believe? Some learning theorists or the OpenML repository? These hardness results didn’t reflect the reality of practical pattern recognition. So why did we believe the hardness results about non-convex optimization?
The thing is, the machine learning community didn’t believe these hardness results. Yann LeCun loves to paint this picture that before ImagenNet everyone at N(eur)IPS was doing SVMs. But you just have to look at the conference proceedings to realize that couldn’t be farther from the truth. People have always done weird, computationally bizarre things at NeurIPS. When I first started doing machine learning, everyone was applying Expectation Maximization to everything. I guess that’s an ethos? Every one of these EM papers was trying to solve NP-hard problems. The reason EM was popular was… well, I never really figured that out. But I can say that coding up EM was always pretty trivial. And it does stuff sometimes.
The same is true of neural nets, which are today very easy to code up. Anyone can download a repo and see for themselves that it’s trivial to find a neural net with zero training error. You will find a globally optimal solution of a nonconvex optimization problem.
Of course, theorists then want to know why these nonconvex optimization problems are easy. I think this is a perfectly valid line of inquiry, and it’s always worth a shot to investigate “surprising” phenomena. Sometimes, careful studies of complicated messes can guide us to simpler solutions. We’re not there yet for neural net optimization. There have been hundreds of papers, but no clear answer has emerged. I know for sure that the conventional wisdom of what makes neural net optimization hard is wrong. I was told “you will converge to local minima.” This is just not true. And anyone who talks about sharp local minima is selling you a bag of goods. Nothing about the contours of the “optimization landscape” tells you anything about neural nets. But as to why optimization is easy, we’re still just guessing. I have some intuitions that I’ll describe in class today (and blog about tomorrow). I don’t know if these intuitions are right. And unfortunately, there’s not going to be a way to “test” if they are right because current models are too complicated.
But I’ve stopped worrying about this. People have been using the Nelder-Mead method or genetic algorithms since the 1960s. Did you know that computer chips are designed using simulated annealing? They seem to be doing just fine! Local search works when it works. And if it doesn’t work, you can mess around with your cost function until it works. That is, you can be an engineer. In the case of neural networks, you can add batch norm, dropout, adaptive stepsizes, residual connections. You can do whatever you want to make the local search easier. Can you guarantee these will always work? No. But why do you want a guarantee?
Honestly, I think Yann (who believes in nothing) is right about why neural nets work. Fully connected deep neural networks are not better than anything on any task. But conv nets, LSTMs, and now transformers are. Why? These mimic the structure of signal processing systems that work well on the domains of interest. Any image processing algorithm, whether for compression, denoising, or filtering, will look like a convolutional neural net. It will be a composition of convolutions, simple nonlinearities, up-sampling or down-sampling, etc. Yann’s idea is that you take primitives from signal processing and globally search for the best end-to-end solution by making sure you can (sort of) take the derivative of every component in the model. That’s certainly an engineering paradigm. And it seems to work for problems people care about.
Note that none of this explanation has anything to do with “neural” anything. The neural net people did themselves a disservice with their branding. All of the pseudo-neuro-scientific blathering didn’t get us anywhere. At the end of the day “artificial neural nets” are just a bunch of computational signal processing primitives chained together and jointly optimized with stochastic gradient methods. Miss me with the brain stuff.
We now have highly tuned automatic differentiation software, fine-tuned optimizers, model architectures tuned to these optimizers, and infinite local search by a community backed by giant piles of money. With enough money, computing, and time, anyone can fit neural nets to data. I suppose that’s a win for humanity.
Totally agree with the point about "neural" -- I did a high school research project involving dragonfly nerve signals, and the metaphor of neuron spiking and nonlinearities seems tenuous at best. I usually favor "deep" since this sort of gets at the idea of composing together useful primitives. But recently I decided to look up the etymology (https://www.etymonline.com/word/neural) and it turns out that "neural" derives from Greek words meaning tendon, sinew, string. So in the spirit of "connecting up" simple functions into an "end to end" network, maybe neural was right all along.
My very SVM (and RLSC) heavy PhD done next to you while you were battling Matlab in 2005 was done mostly in a popular open source package called ... "Torch" http://torch.ch/torch3/ . Note the 2004 release's class list: no matches for "neural" anything, although you certainly could do them with it. http://torch.ch/torch3/manual/index.html . Pre-Torch we got a lot of use out of SVMlight (2002) and NODElib (1999!) Things are faster now, you certainly can train more on a laptop, but honestly not that different.