23 Comments
User's avatar
Alexandre Passos's avatar

Fun thing, I think RLHF as done before this whole math craze is certainty equivalence, where first you train a model of what a human rater will think of an answer, and then you do policy gradient to make your policy model spit out more of the answers that human raters like.

To me the main point you're missing about reform RL (love the term btw) is that it's really a latent variable optimization problem. The way you get to it is by (1) observing that LLMs sometimes give better answers when you let them stick a bunch of random tokens in between the problem and the solution, and (2) guessing and checking which of these tokens work. If we had better ways of guessing what are the good values of these latent tokens to set to get the answers we want we could bypass all this guess and check sadness. But latent variable optimization is still hard.

Maybe because I grew up under old school NLP but to me this is really EM.

Ben Recht's avatar

1. Great point about RLHF. I never understand why they don't backprop through their model once they have built their predictor.

2. I am not sure what to make of the token babbling in language models. I have heard the idea that it's trying to coax out latent state, but I've never seen anyone make that actionable. Do you have any reading you'd recommend for me?

Alexandre Passos's avatar

Re (1) I have often wondered the same thing and I think the reason why trivial implementations of it don't work is that sampling tokens is non differentiable, so even though given a token sequence sampled from your policy your RM score is in principle differentiable you can't backprop through the model weights that produced the sampled token. You can try to make an RM that takes activations as inputs instead of tokens but then you still stumbled upon the fact that once your policy output is more than one token you also cannot easily backprop through the sampling process (since later tokens condition on earlier ones). But it's a subtle point that's never explicitly laid out in the literature, and I think if we can frame it this way then we should be able to come up with a gumbel softmax like trick to make sampling mostly differentiable. I wonder why that has never been tried.

Ironically though PPO has a differentiable value model built in to its math, but word on the street is that if anyone is still using PPO they mostly turn off that part of the algorithm because it doesn't help at all for LLMs. Not sure why.

Re (2) my silly opinion is that token babbling mostly allows the model to put more flops in between the question and the answer, and whatever tokens RL picks are the ones that are most useful at doing that (but I agree that this is a recursive argument).

As someone who spent a lot of time in a big lab (oai in my case) I got into the habit of not reading papers since when this reform RL was being developed the papers talking about it really didn't exist so I don't know what reading to recommend.

My view of why RL sucked in robotics but works in LLM is that reform RL mostly found a setting where RL is very easy to do and has very high value. I remember in atari games or go or whatever you expected the model to go from fully incompetent to fully competent by the virtue of RL alone. That's hard. In the current RL for math you use RL to move the pass rate for each individual problem from ~20% to ~90%, by sampling enough passing solutions that you still have some diversity of trajectories with positive advantage in your batches. This obviously works! It's dumb! The "magic", if any, IMO, is in coming up with enough problems that lie in this sweet spot where the model barely solves it if you try many times and where becoming better at these problems will generalize to useful tasks.

Lukas Nel's avatar

Reminds me of boosting algorithms honestly - I wonder if you can do something similar to AdaBoost for LLMs

Michael Craig's avatar

On 1, isn't direct preference optimisation (https://arxiv.org/pdf/2305.18290) doing this?

Crazy that this wasn't the default approach, but maybe it tells us something about the domains where reformist RL will keep popping up; new, poorly understood ones.

Ben Recht's avatar

What happened with DPO? Does anyone use it?

Michael Craig's avatar

Chad at least tells me the big labs do :shrug:

Seems cited enough that it must be a relatively big deal

Lior Fox's avatar

This has been a great series of posts.

With respect to the "sharpening" results quoted towards the end --

We had a conceptually similar/related finding [1] in the prehistoric times of Encoder-Decoder transformer models, when it was fashionable to do RL fine-tuning for Machine Translation models. People nowdays seem to have re-discovered it a couple of times in the context of RL for LLM (that even training with random reward helps, e.g. [2])

[1] https://arxiv.org/abs/1907.01752

[2] https://arxiv.org/abs/2506.10947

Ben Recht's avatar

Yes, very interesting. Alexandre Passos points out something I've seen observed in many places: reasoning in LLMs only works when the probability that the model might spit out the right answer is already high. That might be part of what makes it such a sweet spot for RL: going from 20% accuracy to 90% accuracy is way easier than going from 0% to 20%. It also confirms that there is probably an even more efficient way if you're already close to the optimum.

Jordan Ellenberg's avatar

I like that Karan-Du paper you linked because my immediate question was "isn't this just lowering the temperature?" and then right there on p.4 they're like "you're probably thinking 'isn't this just lowering the temperature?', here's why it's not just lowering the temperature."

Jordan Ellenberg's avatar

Oh and better yet, once I understood this I was like "OK, fine, but then it seems like this power distribution would be impossible to sample from while changing the temp is pretty easy" and on the very next page they're like "of course this distribution is impossible to sample from so it's Metropolis-Hastings time"

Ben Recht's avatar

I also think the p^2 part is probably not the end of the story, and there are lots of ways to squash out the tails of the distribution that all result in similar performance bumps.

Seth's avatar

I'm slightly baffled by the combination of claims that a) you can just "[b]uild a model of the environment using standard predictive tools", and also that b) "[a]ll of the MDP mumbo jumbo is secondary". Aren't MDPs a standard predictive tool for modeling a wide range of environments?

Ben Recht's avatar

MDPs are secondary, not primary. MDPs are certainly one model of the environment, but there are plenty of others you can use. You don't need to know what an MDP is to apply reinforcement learning. On the other hand, if you are solving a dynamic programming problem, you need to know what an MDP is.

Seth's avatar

Fair, and uh I guess obvious in retrospect! I suppose I placed undue weight on the word "mumbo jumbo" over the word "secondary".

Over in neuroscience land we are often interested in planning in addition to/in contrast with pure experiential learning, so we are quite fond of our (po)MDPs. (Sometimes to our detriment, I admit grudgingly as a [po]MDP enthusiast.)

Allen Schmaltz's avatar

Regarding modifying the sharpness of the distribution, this is relevant: Similarity-Distance-Magnitude Activations (https://arxiv.org/abs/2509.12760)

Regarding how to apply that to LM post-training without RL (i.e., post-training without calibration collapse), this is relevant: Similarity-Distance-Magnitude Language Models (https://arxiv.org/abs/2510.26183)

Tom Dietterich's avatar

Regarding certainty equivalence, that is basically what Yann LeCun is exploring with his JEPA architectures. But building the forward dynamical model is very challenging, especially when there are many aspects of the world that are irrelevant to the task at hand. I think there is still some magical thinking about how to decide what aspects of the world to model and what to ignore. The dynamical model is also often a latent variable model, and one where state estimation may be quite difficult. A healthy aspect of the certainty equivalence perspective is that it makes all of these issues clear.

Tom Dietterich's avatar

From a discussion in the poster session: "RL for LLMs is just another way to do synthetic data generation". I think this is exactly right. You generate a bunch of alternatives, score them and take gradient steps to increase the probability of the best ones.

Aditya's avatar

Which theorem shows that the known bound is "cubic in the dimension of the search space"? The paper's a bit technical, but from what I saw the results look pretty optimistic (I'm interpreting them as even two samples per point is close to gradient descent).

Ben Recht's avatar

I was referring to: "For smooth and strongly convex loss functions, the regret bound of Flaxman et al. (2005) can be strengthened to O(T^{2/3})." which, with respect to dimension, ends up getting you a dimension cubed.

More details are in this paper: https://arxiv.org/pdf/1209.2434

Reviewing this, it seems like the lower bound is technically quadratic in dimension, but all of the algorithms are cubic.

Aditya's avatar

Thanks that's helpful!

Aditya's avatar

To understand more precisely, this would be cubic in the context length, or in the number of parameters in the model? I'll guess the context length.

KM's avatar

Maybe a bit off topic but for the paper you wrote with Damek, is there a link to the actual variance reduction obtained and the induced cost functions? Since the functions are all similar does that mean that the variance of the gradient estimators are all the same as well?