# Why is reinforcement learning so hard to benchmark?

### A new paper provides more evidence that we probably haven't learned much about reinforcement learning from competitive testing.

I've been obsessed with machine learning evaluation for a decade. The community’s research strategy has always been “Run a Kaggle competition and whatever gets the highest score is both the solution to all future problems and how the brain works.” And somehow, this leads to progress! We’ve come up with architectures, algorithms, and analyses through this weird paradigm of competitive testing.

But competitive testing is not a panacea. The current mess around LLM evaluation shows that “just make better benchmarks” can just lead to confusion rather than clarity. Another place where competitive testing has been a woeful mess is reinforcement learning.

Lots of folks have written about the problematic state of reinforcement learning benchmarks. The “robotics” benchmarks in MuJoCo are pretty laughably bad. They have all sorts of sensitivities to random seeding, and dumb algorithms can perform exceptionally well on these benchmarks.

Another famous suite of benchmarks in RL is the “Arcade Learning Environment (ALE),” assembled by Michael Bowling’s group at the University of Alberta. ALE is a bunch of Atari games like Pong and Q-Bert. One of the first splashy demos from Deepmind was training deep reinforcement learning to play these Atari games. But then the RL community decided that these were benchmarks to use in their papers, and thousands of algorithms now “learn” to play Atari.

But is this a helpful exercise? A new paper by Cassidy Laidlaw and advisors shows that the answer is a convincing no. Cassidy wrote code to transform the ALE games into a tabular representation. The tabular representation gives a map from (state,action) to next state. I was somewhat surprised that all of these games had fairly small state spaces (no more than 10 million states).

Once we have tabular representations, we can run boring dynamic programming algorithms to understand the complexity of these associated Markov Decision Problems. Cassidy found that one step of policy iteration solved about 70% of the benchmarks. Moreover, he found a strong correlation between the benchmarks where policy optimization succeeded and the ones where Deep Reinforcement Learning succeeded. Deep RL could only solve easy problems.

What does it mean that policy iteration converges in one step? I still don’t intuitively understand what this means other than the MDPs are really simple. It means that if you consider a greedy action now and then random actions in the future, this is about as good as the optimal action. This means that planning and exploration are not needed to solve these “benchmarks.” And hence, the benchmarks aren’t hard. We all suspected this already, but this case is the most compelling I’ve seen for this set of demos so far.

I had a conversation on Twitter about this, and it was illuminating. Csaba Szepesvari said that approximate dynamic programming was known to have poor behavior on long time horizons, and there were theoretical bounds proving this was the case. This means that if approximate dynamic programming works, it is necessarily solving an easy problem. Nan Jiang also suggested that since the state spaces were so small and such little exploration was being used, neural nets trained on pixels might just be memorizing frames and devising simple greedy policies. Considering that these algorithms run on millions of game rollouts, I wouldn’t be surprised if they were indeed memorizing almost everything.

Is the conclusion that we need better RL benchmarks? I’m not sure. My takeaway is dimmer: Perhaps the only RL problems we have any hope of solving are the easy ones.

It won’t be the same as throwing vitriol on Twitter, but if you have thoughts, please post a comment!

This is super interesting, thanks for sharing that paper and your thoughts.

It reminds me of two things that may [or may not] be somehow related:

1. There were these works on how in some problems choosing the optimal action is easy, even if the full evaluation is hard. In particular I'm reminded of this ICML 2016 paper by Şimşek, Algorta, and Kothiyal (https://proceedings.mlr.press/v48/simsek16.html) regarding Tetris. There is some theory in that paper explaining why and when this might be the case. I'm not sure how well it generalizes to domains such as ALE or MuJoCo, but at least on the level of the basic observation, things seem related maybe.

2. Even more speculative and hand-wavy, regarding the correlation of PI performance w/ PPO performance, maybe this is another hint that all these methods basically do the same thing. In particular, there is a close connection between TRPO and Natural Policy Gradient, and there is a (sort of) equivalence between NPG and PI, at least for tabular softmax policies (Kakade 2001). If the "deep" models rely heavily on memorization of individual state-actions then perhaps it's not entirely unreasonable that most grad steps go into "building the lookup table" and the actual PG part is smaller (hence the success where a single PI step would suffice).

I haven't read the paper yet so not sure about the amount of memory involved, but just curious to see why do you consider a state space of 10 mil to be small? As far as ease is concerned, do you think it is due to the fact that most of these environments are deterministic? (Maybe that is why policy iteration converges quickly?)