There's got to be a better way!
From Reformist RL to the principle of certainty equivalence.
You might come away from Tuesday and Wednesday’s posts thinking I’m a fan of Reformist RL. I am decidedly not, so let me clarify my position. I am a fan of the clarity the Reformist perspective brings. I like that it removes the magical and mystical storytelling from the field. I like that we can cleanly specify the problem without mathematics. I like that I can teach everything needed to spin up reinforcement learning in a single lecture of an undergraduate machine learning course. I like that we can now clearly talk about what RL is, without making tenuous cognitive analogies. And though I have admittedly not engaged with the data enough, reformist RL does seem to have found a strong niche in fine-tuning language models. It seems to work there in a way far more convincing than I’ve seen in any other context.
But there’s an elephant in the room here that I have not discussed this week. In my experience, the techniques you get from reinforcement learning are almost always… bad. In both practice and theory, RL is never what you want. Let me describe what I mean, propose an alternative, and ask whether this alternative can be more broadly applied.
As a computational paradigm, reinforcement learning is brutally inefficient. Policy gradient, the core meta-algorithm of Reformist RL, requires near infinite iterations back and forth with the environment to find solutions. You can even prove this. Almost all of the theoretical results for reinforcement learning are negative! No matter how much mystical “variance reduction” or “advantage estimation” you implement, the rules of reinforcement learning doom your methods to be inefficient. For example, on Tuesday, I described how to use policy gradient to maximize arbitrary functions. The only information the algorithm has access to is noisy evaluations of the objective function. The RL interaction scheme has a technical name: stochastic derivative-free optimization. In this model of optimization, the best algorithms require a number of samples cubic in the dimension of the search space. It is hard to find slower algorithms for minimizing differentiable functions. Similarly, if you believe in Markov Decision Processes, optimal algorithms using the RL interaction scheme require a number of interactions proportional to the number of (state, action) pairs in the system. To find an optimal strategy, you need to observe the impact of every action in every conceivable configuration of the system multiple times. This is also a negative result. How many “states” does a video game have?
Moreover, even within these spaces where all algorithms are inefficient, naive implementations of policy gradient are particularly bad. The glacier melting glacial slowness of policy gradient is why everyone spends so much time inventing new baseline strategies. Unfortunately, implementing these baselines and accelerations correctly is nontrivial. Even the most adherent RL believers will tell you, “Reinforcement learning results are tricky to reproduce: performance is very noisy, algorithms have many moving parts which allow for subtle bugs, and many papers don’t report all the required tricks.” In the same post, they write, “RL algorithms are challenging to implement correctly; good results typically only come after fixing many seemingly-trivial bugs.” You can try to tell me things are better 8 years later, but I’ve run PPO before, folks.
But there is hope. Every time I have looked at a reinforcement learning problem, I’ve found an alternative implementation that is orders of magnitude more efficient. Every. Time. Notably, in most reinforcement learning settings, there is a reasonable alternative to policy gradient that requires vastly fewer samples from the associated environment: the principle of certainty equivalence.
Build a model of the environment using standard predictive tools.
Optimize as if the model were true.
In the reinforcement learning model I put forward on Wednesday, you can prove that certainty equivalence is optimal. I proved this was optimal for the multi-armed bandit in my live blog of Lecture 19. We spend a lot of time in Chapter 12 of Patterns, Predictions, and Actions explaining how certainty equivalence is optimal in other RL settings, such as contextual bandits, MDPs, and optimal control.
Certainty equivalence also reveals that there are other signals you can take advantage of beyond “rewards” the environment provides. In the optimization example from above, an agent can run gradient descent instead of policy gradient, dramatically accelerating convergence. In control settings, state observations can be used to build models more quickly. And autonomous systems can be designed to seek more information if it’s helpful for performance. You can even make systems robust to modeling errors in the certainty equivalent paradigm.
Moreover, and more convincingly, the principle of certainty equivalence is how most control design is actually done. Spend time in a lab building a model you believe in. Execute as if your model is true.
Alright, so let’s go back to why I’m even bothering to talk about RL in the first place. It’s not video games or robotics. It’s reasoning models.1 I awoke from my RL slumber because Dimitris Papailiopoulos kept trolling me about how RL worked now, but only in LLMs.
I started poking around, and everyone was speaking in RL tongues again. They were saying value, advantage, actor, critic, GFYPO. But when I looked at the papers, all I saw was “guess and check.” Guess a bunch of answers to math questions, fine-tune the models when the answers are scored correctly. Damek Davis and I spent a couple of weeks reading ten thousand arXiv papers, and we determined that all of the new fancy reasoning methods were solving microvariations of the same problem: maximize the probability that the LLM would correctly answer questions from the given benchmark.
It was this success in reasoning models that made me realize that guess-and-check is what RL really is. All of the MDP mumbo jumbo is secondary.
So let’s close the loop. In the past, reinforcement learning has always been incredibly hard to tune, inefficient, and slow. Is this time in LLMs different? It could be! RL in reasoning models looks different from RL applied to robotics or even the mess people call “RLHF”. Reasoning models could be the precise niche where it’s really the best thing to do, and we need to cover Kansas with data centers to solve the Putnam Exam. I understand there are capital interests that want us all to believe this is the only way.
But what if they are wrong? What if some means of certainty equivalence is optimal here, too? The arXiv certainly has no shortage of proposed alternatives to RL for reasoning models. Some people show that better prompts give better answers. Some people say you just need better synthetic data. Some people claim you just need to modify the sampler. For instance, Aayush Karan and Yilun Du show that simply sampling from the square of the LLM probability distribution matches the scores of reinforcement learning on many benchmarks. That’s weird! The fact that slightly different sampling gets you most of the RL bang for your buck is certainly suggestive that there’s plenty of room for improvement here. I would not be at all surprised if we could accelerate training reasoning models by a factor of 100. I would not be surprised if someone found a path to a factor of 1,000. That seems to be worth looking into.
Did you see what I did there?

