Demo Or Die

Where do demonstrations fit in the complex world of system evaluation?

Apr 08, 2025

When I was a graduate student there in the aughties, the motto of the MIT Media Lab was “Demo or Die.” We had biannual sponsor meetings where we were compelled to show off embodiments of our research to secure an extra year’s research budget. At the lab, demos were more important than publications. This hyperfocus on demonstration over rigor brought derision from my friends in more respectable labs like CSAIL and LIDS. But all of the Media Lab grad students knew this demo culture was tacky. Technology shouldn’t be based on demonstration, should it?

I’m not so sure. Demonstration is inarguably core to engineering success. Showing you can do something by doing it is more impressive than any plan. Undeniably, it’s been the demos, not rigorous evaluations, that have driven computer science research in the last decade.

Demonstrations in many ways are their own form of benchmark. If we go back to the original definition of an evaluation, an evaluation measures the difference between our articulated expectations of a system and its actual performance. The best demos are like magic tricks, where your observer doesn’t believe what you are about to do is possible, and you demonstrate it is. These are baby steps along the way to making a fully working system.

In Artificial Intelligence, the demonstration-as-benchmark has been core from the start. In the paper that coined the term machine learning, Arthur Samuel argued for his twenty-year obsession with writing a computer program that could play checkers. He wanted to use checkers as a benchmark for more complex decision making. He argued

“A game provides a convenient vehicle for such study as contrasted with a problem taken from life, since many of the complications of detail are removed…. Checkers contains all of the basic characteristics of an intellectual activity in which heuristic procedures and learning processes can play a major role and in which these processes can be evaluated.”

Indeed, one of the ideal parts of using games as demonstrations is that people had gut feelings about how challenging this task was, and thus would be surprised if a computer were competent.

“The activity should be one that is familiar to a substantial body of people so that the behavior of the program can be made understandable to them. The ability to have the program play against human opponents (or antagonists) adds spice to the study and, incidentally, provides a convincing demonstration for those who do not believe that machines can learn.”

Following Samuel, games remain a passion of AI enthusiasts. Backgammon, Chess, Go, Starcraft, and Diplomacy all demonstrate something recognizably difficult to people, and make it undeniable that something interesting is happening.

DeepMind and OpenAI, two of the most successful AI companies, understand the value of demos. Putting on shows builds excitement. Even if what you did isn’t exactly what you wanted, showing progress gets you there. There would be no AlphaGo without the Atari demos. The Atari demos, in retrospect, have barely anything to do with AlphaGo, and they are pretty thin as scientific artifacts and benchmarks. But it was the initial demos on Atari games that secured DeepMind the funding commitments needed to do AlphaGo. In the intellectual tradition of Samuel, even beating humans at Go was a demo. The demonstration of dominance in such a complex game let people imagine that the same ideas could solve other hard problems. Similarly, without big parties for AIs playing DOTA and snazzy videos of robot hands solving Rubik's cubes, a lone engineer doesn’t get to scale up an old-fashioned language model to test when its performance caps out…

I don’t know what the right balance of demo to rigor is. I don’t think anyone does. A demo-only culture creates a nebulous sea of frustration for academic research. Robotics researchers struggle with how much to lean on the demo as a principal form of evaluation. With such diverse platforms and tasks, communal "baselines” and “benchmarks” are hard to pin down. And hence, a movie literally becomes critical to every submission. You could argue that academic robotics has embraced the Media Lab ethos that demos are more important than papers. However, this isn’t necessarily helpful. Without a robust common task, qualitative estimates of impressiveness leave everyone a bit stymied about progress. It makes it harder to build upon each other’s work.

Moreover, in industrial progress, the demo has to eventually become a product. And that product has to be safe and reliable. We can’t build safety-critical systems by demonstration alone. Since demonstrations are usually also vehicles for fundraising, the question is, how much does it cost to go from the initial demonstration to a viable, safe product?

The forgotten world of High Availability Computing invented the reliability concept of “five nines.” A computer system had five nines if it was available 99.999% of the time. In terms of actual time, such systems had downtime of less than five minutes every year. It’s hard to appreciate that in 1980, mainframe systems typically offered only two nines of availability, potentially down for a hundred minutes a week.

Abusing the terminology, demos are systems that typically have one eight. They work 80% of the time. It is remarkable that for almost any problem you can imagine, machine learning can today get you a demo with one eight. This wasn’t true when I was in grad school. It would have made Media Lab demo day a lot easier for me!

And yet, despite this amazing progress in machine learning, the gap between one eight and five nines is often insurmountable. We got lucky with airplanes and computers, where we discovered design principles that gave paths from prototypes to scalable, robust, resilient systems. This is not common, I’m afraid.

We don’t have such design principles with machine learning and AI systems yet. Scaling law enthusiasts claim that you conquer the exponential difficulty of eeking out more nines by building exponentially more data centers. We’ll have to see how much money people are willing to throw at this to find out (so far, a lot). If they’re right, they’ll vindicate the Media Lab’s ethos: demonstrations will be all you need.

arg min

Discussion about this post