The Department of Frictionless Reproducibilty
In which I feel very seen by David Donoho
David Donoho dropped a characteristically provocative essay yesterday about the rise of “Frictionless Reproducibility.” Everyone in the class now has assigned reading. Serendipitously, Dave’s manifesto is fully aligned with where we’re going in this course on machine learning, and I wanted to use his language to frame the next two lectures on the competitive testing paradigm that has driven machine learning research since its inception.
Not surprisingly coined by the man who invented the term “Compressed Sensing,” “Frictionless Reproducibility” is a two-word encapsulation of all of the contributions and aspirations of “Data Science,” a boring catch-all that is a borrowed buzzword from antiquated industrial hype. I’d much rather be in a College of Frictionless Reproducibility. Someone write the Dean to request a name change.
Too often, “Data Science” is cast as a watered-down hybrid of freshman statistics and Python. This couldn’t be farther from the intellectual foundation that has revolutionized quantitative research over the past two decades. Instead, Donoho nails the three pillars:
As he always does, Dave pithily captures what I’ve been fumbling to articulate for the entire semester. Let me explain each of these components in the context of machine learning.
Time and time again, I have cast machine learning as driven by a train-test evaluation paradigm. This is the only component we can justify as fundamental. Hence, Competitive Testing is at the very core of machine learning. We first decide that we only care about “lowest prediction error.” Once we’ve settled on this goal, we choose the one that works best in the lab in our train-test split.
But more is needed to drive progress. As we’ll discuss in today’s lecture, progress in machine learning stalled for two decades when people only did pattern recognition on their own private data sets. Machine learning has only advanced through dumb fights over whose dumb method was better on particular data sets. Sharable Data brings real competition to competitive testing. You’re not getting the test error low because you have an application you really care about engineering. You’re getting it low so you can gloat to your friends and crush your enemies. Sharable Data enables braggadocio. Private data does not.
For better or for worse, competition is the core of all of science. Scientists love to claim they are lofty idealists in search of fundamental truths. But in reality, they are ruthlessly competitive megalomaniacs who want accolades for their brilliance to ring out for centuries. I don’t think you can understand the scientific method without understanding this core human element of savage competition. And the competitive testing paradigm, embraced by machine learning since its dawn in playing checkers, distills competition into the cleanest, most unambiguous terms.
Finally, the most significant change in machine learning practice over the last decade has been re-executable code. I discussed this before, but getting people’s code before 2010 was incredibly difficult. Those who spent time writing good software packages (like SVMLight or Torch) saw their methods receive more citations. But it took a while for the field to catch on that good software was a faster way to a higher h-index than almost any other path. It’s easier to try to beat someone in competition if you can take exactly what they did and only change a couple of parts. David Aha, the creator of the UCI repository, explicitly says he initially built the repository because he thought his methods were better than Ross Quinlan's. (See Chapter 8 in PPA). But when Aha put his repository together in 1987, we had barely invented FTP. In 2008, GitHub launched, and the world was never the same. The GitHub revolution is why we get 13,000 submissions to NeurIPS. All you have to do is download someone else’s code, make a couple of changes until you think reviewers will agree it’s different, and upload a pdf. Frictionless paper writing!
Conceptualizing Frictionless Reproducibility empowers us to reflect. We have made undeniable progress in this research paradigm. Beyond machine learning, this framework promises to revolutionize biology and drug discovery and maybe even medicine. And AI for Science initiatives exclaim they can transform all of the sciences. But what are its limits? What are its flaws? We know blind optimization is grotesque and deeply fragile. Every system that hyper-optimizes one metric necessarily neglects other concerns and, complementarily, is vulnerable to hidden, catastrophic fragilities. What are these in the context of machine learning or Frictionless Reproducibility more broadly?
Over the next few blogs, I will review the history of Frictionless Reproducibility in machine learning. I’ll start in 1959 with the first machine learning data set, a collection of 50 handwritten alphabets gathered by Bill Highleyman at Bell Labs to test early OCR systems. And I’ll end with our massive crawled data sets now used to train enormous generative models for bullshit generation. I’ll try to tease out more reasons why it has succeeded in driving progress in machine learning. But I also want to highlight the warts of Frictionless Reproducibility. Does it entrench power in big companies? Does it harm people through oversight and blindered progress? Donoho has done us all a service by articulating what Data Science is. But now that this is out there, it’s up to us to ensure its future impacts are good.