25 Comments

Totally agree with the point about "neural" -- I did a high school research project involving dragonfly nerve signals, and the metaphor of neuron spiking and nonlinearities seems tenuous at best. I usually favor "deep" since this sort of gets at the idea of composing together useful primitives. But recently I decided to look up the etymology (https://www.etymonline.com/word/neural) and it turns out that "neural" derives from Greek words meaning tendon, sinew, string. So in the spirit of "connecting up" simple functions into an "end to end" network, maybe neural was right all along.

Expand full comment

The etymology may work out by coincidence, but that doesn't alleviate the neuro-babble and god complexes that come with associating a mess of linear algebra modules with "the brain."

Expand full comment

Hi Sarah, you are right that it is a simplification at best... there is more going on. But some similarities: Real neurons require their input (usually summarized as an electrical current) exceed a threshold to produce a spike. As the input is larger, they produce more spikes / second. So there is truly a threshold nonlinearity going on. Beyond that, real neurons have more complicated dynamics that start to matter for things like synchrony and the timing of concurrent inputs. People argue about when/where those details matter in different neural systems. Neuroscientists tend to refer to the artificial neuron models that we use in deep learning as equivalent to "rate coding", where all that matters is the firing rate of a neuron in a biological computation.

Expand full comment

My very SVM (and RLSC) heavy PhD done next to you while you were battling Matlab in 2005 was done mostly in a popular open source package called ... "Torch" http://torch.ch/torch3/ . Note the 2004 release's class list: no matches for "neural" anything, although you certainly could do them with it. http://torch.ch/torch3/manual/index.html . Pre-Torch we got a lot of use out of SVMlight (2002) and NODElib (1999!) Things are faster now, you certainly can train more on a laptop, but honestly not that different.

Expand full comment

Yes, 100%. But I stand by what I wrote here:

"there were some free packages for SVMs or what have you, but you were never sure if they’d actually run."

It took some care to get SVM Light to do what you wanted! It wasn't as simple as "data in, support vectors out."

The big difference is that today people post on github the exact formula they used to go from dataset download to model output. They tell you the exact libraries they used. And sometimes they even give you a docker instance.

In any event, amazing that Torch didn't have neural nets in 2004. A bizarre contradiction to the current revisionist history.

Expand full comment

Yep, it was infinitely more annoying not having stack overflow or discord. I learned far more C++ than I ever needed since getting that stuff going. You could definitely train NNs in Torch3, but you had to understand what they were before you used them. These days I'm so happy to have such a strong community but you do still find yourself out on the old ledge alone once you start trying something unique or new. Being able to (sometimes) boot 16 GPUs on demand feels like a much bigger step change for this research to me. In conclusion, get off my lawn

Expand full comment

My Ph.D predates yours by about 20 years, so perhaps I can give some additional context on why training was seen as difficult. To my mind, our bias-variance induced obsession with "the smallest possible network", along with computational constraints, forced us into the regime where local minima were a real problem. With those smaller networks, various random initializations would result in significant variability in validation-set accuracy, so we were also concerned about the generalization gap - which muddies the waters regarding the training task.

Expand full comment

I definitely agree that many things are only obvious in hindsight. I also remember trying to train small neural networks and found it perplexing.

Expand full comment

Yep, and all of this dovetails nicely with what I posted last night on Twitter:

"Philip Agre makes an interesting distinction between two approaches to AI (and it applies to systems in general): generative and architectural. Generative approach (Chomsky, Simon and Newell) favors systems that can produce infinite variety of formal structures by repeated application of finitely many basic rules (think free groups). Architectural approach (Minsky’s Society of Mind) takes the limitations imposed by physical realizability (locality, causality, etc.) seriously and thus ends up favoring distributed systems consisting of multiple communicating modules whose operation may lead to inconsistencies, etc. Brooks’ behavioral robotics is a good example. Ironically, so is connectionism."

Add differentiability to each module, hack together some objective, and let gradient descent with backprop rip, and there you have it!

Expand full comment

Hah, ironically indeed. Minsky would have lost his shit if you made that (cough) connection to his face.

Expand full comment

I'm all about ruffling feathers, man.

Expand full comment

Computational neuroscientist coming in to defend my field here... Ok, so first you are right that signal processing "explains" why convolution makes sense. I would argue that the brain uses the representations that it does for the same signal processing reasons. It is true, historically, that when Fukushima implemented the first CNN he was thinking of features in visual cortex, and Rosenblatt was thinking of a retina when developing the (multilayer, locally-connected) perceptron, and McCulloch & Pitts also were thinking of how a neuron responds to a stimulus when they built the threshold unit. These folks were also versed in signal processing but inspired by what, at that time, was cutting-edge neuroanatomy and physiology. There is still a pretty good cross-pollination between theoretical neuroscience and computer science.

While nobody understands "how the brain works" in the same way no physicist understands "how the universe works", we do have some pretty good knowledge of subsystems of the brain.

On the other hand, you are totally right that when people claim the things that artificial neural networks accomplish these days are because they "are built like brains" is a sketchy claim at best. It isn't obvious at all whether "attention" as implemented in transforms works at all like attention in the brain. Your post is mostly about the challenge of optimizing an ANN. Most of the evidence is clear that synaptic plasticity in brains is largely different than backpropagation.

So while I agree with you there that current deep learning is very un-neuronal, I would argue that there is something to be learned by studying the brain and learning in the brain that can apply to computational methods with artificial neuron units.

Expand full comment

"Fully connected deep neural networks are not better than anything on any task."

This is not quite right. Fully connected neural networks can learn, e.g., a single index model much better than a radial kernel.

Expand full comment

what is a single index model?

Expand full comment

Consider the following setting, let x be sampled from a normal distribution in R^100.

y = f(x_1), where x_1 is the first coordinate of x (you can take f(<x,v>) where v is some unit vector, if you prefer). f can be some simple non-linear function, it does not matter too much which. Then FCNN will do a much better job at prediction in terms of sample complexity than a radial kernel.

Expand full comment

I don't consider this to be evidence. And it's sort of indicative of the problem with a lot of machine learning and computer science theory. Inventing an instance where a method is bad doesn't tell you much about what a method actually does.

Sure, the simplex method is exponential time in worst case. No instances look like this.

Sure, SAT solving is NP-complete. But we can solve SAT problems with millions of variables with easy.

Sure, a three-layer neural network can approximate functions that two-layer neural networks can't. But three-layer neural networks are usually worse than running XGBoost on "tabular" data.

Worst-case complexity analyses can be simultaneously correct and harmful.

Expand full comment

Ben, this is not a theoretical example. This is a very practical setting when your target function depends on just a few coordinates. You can try it yourself in a few lines of code.

Expand full comment

Show me a practical example where 99% of the coordinates are completely useless for prediction, but the person who wants to make predictions doesn't know which ones are which.

Expand full comment

You can take a look at out paper: https://arxiv.org/abs/2212.13881

There are quite a few counter-intuitive examples, e.g., lipstick predictor for images actually looks at the eye area.

Expand full comment