Why is reinforcement learning so hard to benchmark?
A new paper provides more evidence that we probably haven't learned much about reinforcement learning from competitive testing.
I've been obsessed with machine learning evaluation for a decade. The community’s research strategy has always been “Run a Kaggle competition and whatever gets the highest score is both the solution to all future problems and how the brain works.” And somehow, this leads to progress! We’ve come up with architectures, algorithms, and analyses through this weird paradigm of competitive testing.
But competitive testing is not a panacea. The current mess around LLM evaluation shows that “just make better benchmarks” can just lead to confusion rather than clarity. Another place where competitive testing has been a woeful mess is reinforcement learning.
Lots of folks have written about the problematic state of reinforcement learning benchmarks. The “robotics” benchmarks in MuJoCo are pretty laughably bad. They have all sorts of sensitivities to random seeding, and dumb algorithms can perform exceptionally well on these benchmarks.
Another famous suite of benchmarks in RL is the “Arcade Learning Environment (ALE),” assembled by Michael Bowling’s group at the University of Alberta. ALE is a bunch of Atari games like Pong and Q-Bert. One of the first splashy demos from Deepmind was training deep reinforcement learning to play these Atari games. But then the RL community decided that these were benchmarks to use in their papers, and thousands of algorithms now “learn” to play Atari.
But is this a helpful exercise? A new paper by Cassidy Laidlaw and advisors shows that the answer is a convincing no. Cassidy wrote code to transform the ALE games into a tabular representation. The tabular representation gives a map from (state,action) to next state. I was somewhat surprised that all of these games had fairly small state spaces (no more than 10 million states).
Once we have tabular representations, we can run boring dynamic programming algorithms to understand the complexity of these associated Markov Decision Problems. Cassidy found that one step of policy iteration solved about 70% of the benchmarks. Moreover, he found a strong correlation between the benchmarks where policy optimization succeeded and the ones where Deep Reinforcement Learning succeeded. Deep RL could only solve easy problems.
What does it mean that policy iteration converges in one step? I still don’t intuitively understand what this means other than the MDPs are really simple. It means that if you consider a greedy action now and then random actions in the future, this is about as good as the optimal action. This means that planning and exploration are not needed to solve these “benchmarks.” And hence, the benchmarks aren’t hard. We all suspected this already, but this case is the most compelling I’ve seen for this set of demos so far.
I had a conversation on Twitter about this, and it was illuminating. Csaba Szepesvari said that approximate dynamic programming was known to have poor behavior on long time horizons, and there were theoretical bounds proving this was the case. This means that if approximate dynamic programming works, it is necessarily solving an easy problem. Nan Jiang also suggested that since the state spaces were so small and such little exploration was being used, neural nets trained on pixels might just be memorizing frames and devising simple greedy policies. Considering that these algorithms run on millions of game rollouts, I wouldn’t be surprised if they were indeed memorizing almost everything.
Is the conclusion that we need better RL benchmarks? I’m not sure. My takeaway is dimmer: Perhaps the only RL problems we have any hope of solving are the easy ones.
It won’t be the same as throwing vitriol on Twitter, but if you have thoughts, please post a comment!
Thanks for reading arg min substack! If you dig it, please subscribe.