Discussion about this post

User's avatar
Alexandre Passos's avatar

Fun thing, I think RLHF as done before this whole math craze is certainty equivalence, where first you train a model of what a human rater will think of an answer, and then you do policy gradient to make your policy model spit out more of the answers that human raters like.

To me the main point you're missing about reform RL (love the term btw) is that it's really a latent variable optimization problem. The way you get to it is by (1) observing that LLMs sometimes give better answers when you let them stick a bunch of random tokens in between the problem and the solution, and (2) guessing and checking which of these tokens work. If we had better ways of guessing what are the good values of these latent tokens to set to get the answers we want we could bypass all this guess and check sadness. But latent variable optimization is still hard.

Maybe because I grew up under old school NLP but to me this is really EM.

Lior Fox's avatar

This has been a great series of posts.

With respect to the "sharpening" results quoted towards the end --

We had a conceptually similar/related finding [1] in the prehistoric times of Encoder-Decoder transformer models, when it was fashionable to do RL fine-tuning for Machine Translation models. People nowdays seem to have re-discovered it a couple of times in the context of RL for LLM (that even training with random reward helps, e.g. [2])

[1] https://arxiv.org/abs/1907.01752

[2] https://arxiv.org/abs/2506.10947

21 more comments...

No posts

Ready for more?