One of the more irksome parts of Frictionless Reproducibility is how people dismiss this by saying it’s obvious. It’s obvious that you should split your data into training and testing. It’s obvious that shared data sets are the best way to evaluate machine learning methods. It’s obvious that code should be widely shared and open.
None of this was “obvious” even in 2012. Highleyman’s data is an anachronism decades ahead of its time. Indeed, despite the formation of journals and conferences, there was a great stagnation in pattern recognition methods after 1970. Papers from the 1970s in Pattern Recognition operated on tiny datasets that were seldom shared. By the time people started embracing the word “machine learning” in the 1980s, pattern recognition wasn’t even part of the story.
I end up coming across characters who don’t get fancy awards who are arguably more important for driving the field forward. Why these people don’t get credit is worth pondering. Reading the papers and talking with the people active in the 1980s, it seems like the "connectionists" were only right by accident.
At the first ICML workshop, Herb Simon declared the Perceptron “didn’t get anywhere.” The papers at this meeting are also unreadable by today’s ML researchers. They are based in old-fashioned AI, and even figuring out the “results” is challenging for the modern reader.
The first N[eur]IPS 0 proceedings from 1987 are, on the other hand, written in the same lofty, romantic language as today. The ideas and aspirations at N[eur]IPS haven’t changed much in the three and half decades of the conference. I always love to post this selection from [the first official proceedings](https://proceedings.neurips.cc/paper_files/paper/1987) as the titles are timeless:
- MURPHY: A Robot that Learns by Doing 
- How Neural Nets Work 
- Encoding Geometric Invariances in Higher-Order Neural Networks 
- Performance Measures for Associative Memories that Learn and Forget 
- An Optimization Network for Matrix Inversion 
- Constrained Differential Optimization 
- Introduction to a System for Implementing Neural Net Connections on SIMD Architectures 
If you had told me these were from 2017, I’d have believed you.
But it also wasn’t clear at N[eur]IPS that Pattern Recognition would end up consuming this conference. John Platt (who wrote that excellent Constrained Differential Optimization paper) and I discussed this a few years ago. He recalled confusion and excitement:
“Remember that in that era, we were deeply deeply confused about what ML was about. We didn't even realize it was a branch of statistics until Baum and Wilczek published their paper in N[eur]IPS 0.”
“Pre-1987, the neural network field was deeply confused, but in a very hopeful way (it wasn't called ML). There was hope that neural nets would displace all computation. That it would be a new way to program, with Brain-like software and hardware. There was tremendous excitement in the air, even if we were all deeply confused. Everyone was defining their own problem and pulled in different directions.”
Data-set benchmarking and competitive testing found their way into machine learning in the late 1980s through a combination of factors. The wikipedia has a funny page listing datasets for machine learning research. With the lone exception of Fisher’s Iris dataset, all of these datasets are from 1986 or later. And Iris itself was not a machine learning benchmark until 1987.
What was special about the late 1980s? First, email and file transfer was becoming more accessible. The current specification of FTP was finalized in 1985. In 1987, a PhD student at UC Irvine named David Aha put up an FTP server to host data sets for empirically testing machine learning methods. Aha was motivated by service to the community, but he also wanted to show his nearest neighbor methods would outperform Ross Quinlan’s decision tree induction algorithms. He formatted his datasets using the “attribute-value” representation that Quinlan had adopted with his TDIDT algorithm. And so the UC Irvine Machine Learning Repository was born.
The other notable shift in machine learning was a demand from funding agencies for more quantitative metrics. AI had found itself in one of its perennial funding winters, and program managers demanded more “results” before they’d be willing to write grant checks. In 1986, DARPA PM Charles Wayne proposed a speech recognition challenge where teams would receive a training set of spoken sentences and be evaluated by the word error rate their methods achieved on a hidden test set.
Wayne worked with the National Institute of Standards and Technology to create and curate this data set which we now know as TIMIT. TIMIT was still a bit too large to share via file transfer. So the TIMIT dataset was released in December of 1988 on a CD-ROM, the punch cards of the 1990s.
Improvements in computing greased the wheels, giving us faster computers, faster data transfer, and smaller storage footprints. Demand for quantitative metrics forced us to find consistent, reliable quantities for comparison. But no matter how hard I try to come up with an explanation, I can’t understand how the only sticky part of machine learning is this core of mindless pattern recognition. How is it that the shift in the 1980s ends up looking exactly like the paradigm Highleyman kickstarted in 1959? How is it that early ICML and NIPS had no idea that Pattern Recognition would be the only thing that would work? How is it that for every “benchmark” task that comes out, we end up with a train-test split and some simple, climbable metric? And why is it that certain benchmarks carry more weight than others? These are the questions that most interest me in machine learning. Tomorrow, I’ll look at some of the blockbuster benchmarks to see if we can extract any insights.


Inspired by all your digging, I started looking into the history of cross-validation. Via Cosma Shalizi's excellent notes https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ I discovered Stone (1974) "Cross-Validatory Choice and Assessment of Statistical Predictions". It contains a brief discussion of pre-1959 sample-splitting work (from the 30s, 40s, and 50s), including a series of papers published in a 1951 "symposium" on "The need and means of cross-validation". I'd love to hear your thoughts on this line of historical work, as it relates to Highleyman’s contributions!
Platt scaling always seemed like some kind of duct tape (in a respectful way).