This is a live blog of Lecture 3 (part 2 of 2) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
At a workshop in 1980 that would become the annual ICML conference, Herbert Simon delivered a plenary about the past, present, and future of machine learning1. Computer power had exponentially exploded in the 22 years since Rosenblatt’s perceptron. Why would we build machine learning systems at all?
“Who-what madman-would put a computer through twenty years of hard labor to make a cognitive scientist or a computer scientist out of it? Let's forget this nonsense-just program it." It would appear that, now that we have computers, the whole topic of learning has become just one grand irrelevancy for computer science.
Simon went on to survey some of the successes of classical machine learning and how they were disappointing. All of his examples were from the 1950s, when we were stuck programming on vacuum-tubed mainframes: Samuel’s checkers player, EPAM trees, and the Perceptron. With regards to the Perceptron, Samuel didn’t mince words:
A final "classical" example (this is a negative example to prove my point) is the whole line of Perceptron research and nerve net learning [Rosenblatt, 1958]. A Perceptron is a system for classifying objects (that is, a discovery and learning system) that computes features of the stimulus display, then attempts to discriminate among different classes of displays by computing linear additive functions of these features. Functions producing correct choices are reinforced (receive increased weight), those producing incorrect choices have their weights reduced. I have to conclude (and here I don't think I am in the minority) that this line of research didn't get anywhere. The discovery task was just so horrendous for those systems that they never learned anything that people didn't already know. So they should again strengthen our skepticism that the problems of AI are to be solved solely by building learning systems.
I wonder what he’d think of ICML 2023.
There are two knee-jerk reactions to Simon’s quote. The first is “What a fool! Perceptrons are everywhere, and it was only after embracing the Perceptron that machine learning took over all of computer science and engineering.” The other reaction is “What a genius! Perceptrons are everywhere, and machine learning systems are inefficient monsters that don’t really work even though we’ve thrown billions of dollars at them.”
Choose your fighter!
It is certainly undeniable that Simon’s hyperbole was fighting fire with fire. Artificial Intelligence has always been plagued by arrogance, hyperbole, and myopia. Even the name is a marketing term. It’s sort of hard to blame Simon for this dismissal when, only 20 years earlier, this was the lede on the front page of the New York Times:
That’s the 1958 article about the Perceptron. Artificial Intelligence is the only field that perpetually fails to deliver on the same promise yet still gets overfunded. Please pillory me in the comments with claims that even though they’ve been saying the same bullshit for 70 years, this time they’re right.
Rosenblatt thought that his simulated neuron would become conscious of its existence. Simon thought that machines were decidedly different than people. How do we square these two visions? To find that middle ground, let me bring up Minsky and Papert’s attack of the Perceptron.
In their 1968 book Perceptrons, Minsky and Papert famously showed that the Perceptron couldn’t learn the parity function. With rigorous mathematics, the showed a Perceptron couldn’t learn to predict whether the number of 1s was odd or even in a bit string. But let me tell you why this is a positive case for machine learning.
Think about the sorts of pattern classification tasks where machine learning works well. For example, part of the ImageNet classification task is distinguishing between 118 dog breeds. If you show a trained human one of these pictures, they can learn to distinguish them with over 80% accuracy. Neural net models can do slightly better than this. But we don’t know how to write a computer program with competitive dog breed accuracy that doesn’t use machine learning.
The standard resolution of a picture on an iPhone 14 is 4 megapixels, about 100 megabits when uncompressed. A human looking at an image of this size can classify a dog. But imagine showing someone a bit string with 100 million ones and zeros and asking whether the number of ones was even or odd. Not a single person could do this. And yet evaluating
X.sum() % 2
takes 20ms on my laptop. I can write a trivial program to compute the parity of the bits representing a file, but I can’t see the parity.
People can learn to make highly accurate predictions for dog breeds, but writing a syntactically correct computer program to complete this task feels impossible. No human can see parity, but writing a computer program to compute parity is beyond trivial. This paradox drives AI researchers insane.
People read Minsky and Papert’s book as claiming that Perceptrons should be abandoned. But in their revised edition that appeared 20 years later, they argued this wasn’t what they were after at all. They wanted an understanding of knowledge representation. What was the right way to arrange the bits so that computers could be guaranteed to be able to recognize patterns? They wanted a mathematical definition of what makes a pattern recognizable.
Unfortunately, the answer seems to be “If I can get my pytorch to work, then the pattern is predictable.” And, conversely, if I can prove a pattern is not predictable, I can write a short computer problem that gets 100% accuracy. I can understand why this is disappointing to the symbolic logic folks. The Perceptron analysis shows that if a data set admits a large margin classifier, then the Perceptron converges quickly. But which problems have large margin? I’m sorry to say, I have no idea.
Here’s my quiz to determine whether a pattern classification problem is solvable.
Does conventional wisdom say that the classifier is straightforward?
Can we not write a computer program that does it?
Can we collect a lot of data?
Are the stakes low if you make an error?
Do you have access to a ton of computing power?
If you answered yes to all of these questions, then go machine learn away! Machine learning is for the problems where we are convinced there is a good classification rule, but we can’t articulate how to write this rule in computer code. (And yeah, I think protein structure prediction falls into this category). It’s for problems where it’s OK if I make a few mistakes as long as we can reasonably capture the gestalt.
Make a model large enough. Make your data large enough. Add enough layers. You’ll find a perfect classification rule eventually. Or you won’t. Now go engineer stuff for me, bucko.
Machine learning is what we do when we don’t understand. When we do understand, we just write the damned code.
The plenary is reprinted in chapter 3 of Machine Learning. An artificial intelligence approach. eds. R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. 1983.
"Machine learning is what we do when we don’t understand. When we do understand, we just write the damned code."
Yep, exactly: Hume (inferring necessary connections from constant conjunctions) when we don't understand, Kant (synthetic a priori and all that jazz) when we do.
I am still not sure ,whether linear rules can become conscious though!