I don’t mind being forced to think using equations but the benefit of natural language is I can use it to explain difficult ideas to those who resist it.
Having said all that, I'm also not sure that the "pure PG" view totally saves you here.
To explain why, I'm going to take the liberty and write my version of your promised blogpost:
So here's RL without equations. In order to learn-by-doing, you iterate the following two steps:
1. Try to estimate how good you're currently doing
2. Adjust what you're currently doing so that you are [slightly] better the next time around.
This is basically what Sutton and Barto termed "Generalized Policy Iteration" in their book. The 2 steps doesn't have to be as distinct, can be partial/parallel, etc.
Now,
- In a "vanilla" REINFORCE, you do step 1 in the most straightforward / naive way possible, by just collecting a handful of samples and relying on Monte-Carlo, and you do step 2 in a "sophisticated" way of taking gradients of your policy directly.
- In TD, you do step 1 in a slightly more sophisticated way (which is possible because you bring in more assumptions about the structure of the task), using approximated dynamic programming. you do step 2, otoh, in a kind-of stupid way by acting greedily wrt your current estimate from step 1.
But **crucially**, you still need both steps either way you're going about it. The reason you need (1) is that there's no teacher (which, for me, *is* the fundamental defining property of "what is RL").
Shameless plug: I've written a short paper about this earlier this year, from the perspective of RL in CogSci/Nuero (I know, I know -- we aren't allowed to refer to these fields on this blog...).
It turns out that quite a few people advocated for a "reformist" view there as well, promising that if we start and end with PG we will resolve a lot of conceptual issues with our models (of animal/human learning and behavior). I disagree with this claim, and I outlined the main argument of the short paper above here (there are a few more points discussed in the full text).
I will very much appreciate your thoughts, if you want to read it (short, i promise), it's available here: https://arxiv.org/abs/2505.04822
I like your definition a lot, BUT, there's a caveat I've seen in the community. Not every algorithm of the form:
1. Try to estimate how good you're currently doing
2. Adjust what you're currently doing so that you are [slightly] better the next time around.
gets to be RL! For instance, imitation learning works this way, but it's something else. As another example, in language models, you can do math reasoning by generating a bunch of answers to questions, finding the ones that pass a verifier, and then fine tuning on these. If you do this the way I propose, it's called "supervised fine tuning" and is also not counted as RL.
This is what makes RL so perplexing. It's everything and also a very specific thing at the same time...
Anyway, let me read your note first before I write a more detailed reply.
My understanding is that people in fact *do* call the "generate-verify-improve" thing "RL" in this context. In particular, this is different than supervised fine-tuning where you train in the usual MLE sense on hand-ceafted examples (say, a prompt+answer that some freelancer wrote for OpenAI). Here the model generates lots of outputs and you try to reinforce (ha) one particularl outputs, where the evaluation function is some blacbox (a model of "human preference", a formal verifier, etc). So you do some form of PG (and gets to reinvent a lot of new acronyms). This is a limited notion of RL (the "RLHF" almost feels to me as no RL at all, the formal verifiers maybe more so, but these are just vibes).
Regards imitation learning, that's a good point. It perhaps depends on what form/variation exactly. For straight behavioral cloning I'd say you don't really have to do step 1, because the teacher is doing it for you (you have reference actions which defines your target, you don't have to evaluate how good your policy is independently).
I like your paper! However, I think I'm after something slightly different here. I actually think the dynamic programming stuff is a total red herring when it comes to the appeal of "RL" methods. I like your definition of RL as
1. Try some stuff with your model
2. Score the outcomes
3. Update your model based on those scores
That's not a bad scope! But I'd argue that in reformist RL, you don't need to have an MDP to do 1-3. It applies to any problem where you can sample from a model and score the outputs. And this is why I think you can just start with policy gradient and see where you end up.
I tried to pull this out in the next post, highlighting that "finite differencing in random directions" is a policy gradient method even though there are no MDPs or online models in sight.
I do agree that the focus of that paper is somewhat different (again it's written largely form the perspective of an internal neuro/CogSci debate).
I think the dynamic programming is a derivative of the specific realization chosen for step 2. Even if you "start with PG and see where you end up", if you take your objective to be "maximize expected cumulative discounted reward" you very quickly "end up" with the notion of Value functions (e.g. the PG theorem in Sutton et al. 1999).
And even within the framework of your next post, the fact that you know something extra about the objective is useful, for example, precisely for designing good baseline functions for your PG algorithm. (The something extra here is the dynamic programming _structure_ of your objective, or perhaps better to say: the optimal sub-substructure / optimality principle).
Having said all that, I think it's true that RL people are often fixating too much on the specific MDP formalism, even in situations where it doesn't make much sense, or doesn't "buy" you that much. And that in essence, you don't need MDPs for RL. The MDP framework is a particular _model_ for sequential decision making tasks (and a rather restrictive one), but not the only one possible.
I think it's worth noting that there are some "reformists"-like voices making similar or related claims from the very centre of the Albertean high-church.... e.g.: https://arxiv.org/pdf/2504.08161, and also in a talk format from the last RLDM here: https://www.youtube.com/watch?v=Vjx7r9mo0H0 . I thought it was a very nice talk, would recommend watching it.
(Finally -- calling it "my" definition of RL is a bit of a stretch! The intuition that this is the core of RL is at least as old as Thorndike's "Law of Effect")
I think Ben's point about not needing value functions in general is that they only arise if the cost functional has additional internal structure to it (e.g., if, to use Peter Whittle's apt term, it itself represents some sort of an "optimization over time" situation). Then, in addition to the time in the outer loop in Ben's RL meta-algorithm, we end up with "virtual" time in the inner loop where the objective is computed.
And, to add to this: Sutton-Barto style RL erases the difference between the "real" (outer-loop) time and "virtual" (inner-loop) time by collapsing both loops into one. This is unnecessarily confusing because it obscures the essence of the basic meta-algorithm, where only the "real" time is the one that matters because it governs the timescale on which the learning actually takes place.
Thanks again for this Ben! I'll always think that in another life you could have been a great comedian (professionally)
This reminds me of two points:
- in undergrad, we had a math professor who told us "every piece of applied math that's ever been useful is just integration by parts in a trenchcoat" - this somewhat applies here
- I've always found it strange that while 0-th order optimization is a tool in RL (with environments, MDPs, and all that), people us "RL" as shorthand for 0-th order optimization
I would love an explanation without the equations. I've been thinking about RL, at least in terms of the current slate of LLMs, as "optimization for a values system wherein nobody except machine learning engineers is able to select the values or has any say in why they are prioritized." Would love to know if that's an accurate description.
I think the heart of RL is that you only obtain feedback on the action (or action sequence) that you select. So bandit problems (esp. contextual bandits) are the purest form of RL (and they do not involve MDPs or dynamic programming etc.). We adopt the MDP formalism when the reward is (a) delayed and (b) additive because those allow us to do something more than just policy gradient. However, policy gradient has been winning in practice. Among the advantages of policy gradient are that it works for POMDPs and it works even when there is a major time scale mismatch between the individual actions (time steps) and the trajectory length. For long trajectories with "tiny" actions, the Q function (learned by DP) needs to make very fine distinctions, and even with tricks like Advantage Updating, these become absurdly sensitive to noise and roundoff error. Representing Q with a neural network is usually a disaster.
The post paints RL as more of a “culture/quasi-religion” than a technical area, and I’m not sure that generalization is fair. There’s definitely hype and some inflated claims -- particularly around commercialization -- but there’s also a wide spectrum of serious work and healthy internal skepticism.
"Policy gradient as the core" is a nice pedagogical simplification. But MDPs and value functions are not just liturgy: the MDP formalism is a minimal setting where temporal credit assignment is well-posed, and it’s a convenient sandbox for stress-testing ideas. Yes, it has limitations and is sometimes misapplied -- but so are most abstractions.
On the "RL is optimization" point: as you wrote, changing the weights changes the data distribution. But that is exactly why on-policy policy gradient is not "just optimization" in the same sense as vanilla stochastic gradient: the parameters determine what data you get to see next. Because updates change what gets sampled next, you can get genuine signal-to-noise / exploration issues. For example, if the updates drive the probability of a good action very low, one may almost never sample it again, which weakens the "learning signal". This coupling is what makes the problem interesting. While at the moment, most contributions to this problem come from the RL community, it would be great to see the control and OR folks also joining the fun (as sometimes they do).
Relatedly, these issues are where the on- vs. off-policy distinctions also earn their keep. Off-policy decouples data collection from the current policy, so "collapse of the current policy" is not the same bottleneck -- but one pays for it via coverage/support requirements and distribution-shift corrections (often via importance weighting), which can introduce their own variance and instability. On-policy avoids that particular mismatch, but then one must understand when the coupled learning-sampling dynamics do (or do not) self-sabotage.
At the risk of a bit of self-promotion, I’d be remiss not to mention that it was only very recently that we managed to show (in tabular/softmax settings) that constant-step-size on-policy policy gradient provably converges globally. The argument is (to me) quite fun, and credit is due especially to the student authors (Sam and Thang), who developed much of it; it relies on Borel-Cantelli-style reasoning together with delicate martingale machinery. The paper is here: https://openreview.net/pdf?id=YzriuQGaNX
Finally, a strong counterpoint to some of what is written here is what I frequently hear from practitioners: even when they start from PG, they often end up needing value estimates (baselines/critics) to make things work reliably. That looks less like ideology and more like empiricism.
Greetings from the frozen land of Alberta ❄️☃️❄️⛄️.
“This means that if you can sample from p(x; w) for any setting of the weights, you can run stochastic gradient ascent on the objective you care about.”
Did you mean “if you can evaluate f(x) at any sampled x”? Because you still need the gradient of log p(x; w).
And in the interesting cases p(x;w) is not available. But fancy math can tell us that we shouldn't be bothered by this: we can use the gradient of log probabilities of the action-sequence we generated. Sadly, to get to this conclusion one needs to assume more structure. Or we can call or a day and not use PG but go with whatever zero order optimizer we have at hand😄
For a more general defn of RL, Why not just pi*=argmax E_piRwhere pi is a policy, R a reward. Why introduce idea of gradient? Can optimize in other ways. Also I disagree that it’s optimization “of” probability distributions; rather, I think it’s optimization over probability distributions
This is Demis Hassabis documentary on his life and interests (mostly RL) recently released (6 days ago) and watched 14million times. He has also received Nobel prize in Chemistry a few years ago, as we all know. A surprisingly rewarding end to his expore-exploit cycle.
But... "all of the applications of RL in language models* and 95% of applications of RL outside language models" do use a value function baseline, which comes from thinking about MDPs, dynamic programming, reward-to-go, and "all that fancy math".
*Yes, I conveniently omitted GRPO, but one could argue that this is less egregious than you conveniently omitting the idea of value function baselines entirely :)
In this case I'm referring to something quite specific, which is the value function baseline learned via GAE that's used in the vast majority of policy gradient methods in practice (including LLM RL and sim-to-real robot locomotion).
Now I feel like I'm feeding the troll here, but I defined it in the above comment: it's "the thing that is learned via GAE and subtracted from the reward-to-go in most real-world implementations of policy gradient methods".
Sutton recently mentioned TD-Gammon as a turning point success of RL. The basic form is in your reductionist RL domain. How it evolved to current BG products, w/optimized rollouts, separate game-stage networks, competitive NNs to abandon NNs that fell into ruts, transitions to deterministic near-end-of-game evaluators, end- user tunings of rollout params and related position setups, i.e., the engineering refinements, can be applied to other RLs. I'd just focus on that mature example as an RL description.
Yes do please
I don’t mind being forced to think using equations but the benefit of natural language is I can use it to explain difficult ideas to those who resist it.
Oh boy. This one is hitting close to home...
I totally get the appeal for a "reformist" view. I'm also sympathetic to the idea that all these RL methods basically do "the same" thing, and that this "thing" is a rather simple random search style [for those who remember, we discussed this here: https://open.substack.com/pub/argmin/p/why-is-reinforcement-learning-so?utm_campaign=comment-list-share-cta&utm_medium=web&comments=true&commentId=18055037]
Having said all that, I'm also not sure that the "pure PG" view totally saves you here.
To explain why, I'm going to take the liberty and write my version of your promised blogpost:
So here's RL without equations. In order to learn-by-doing, you iterate the following two steps:
1. Try to estimate how good you're currently doing
2. Adjust what you're currently doing so that you are [slightly] better the next time around.
This is basically what Sutton and Barto termed "Generalized Policy Iteration" in their book. The 2 steps doesn't have to be as distinct, can be partial/parallel, etc.
Now,
- In a "vanilla" REINFORCE, you do step 1 in the most straightforward / naive way possible, by just collecting a handful of samples and relying on Monte-Carlo, and you do step 2 in a "sophisticated" way of taking gradients of your policy directly.
- In TD, you do step 1 in a slightly more sophisticated way (which is possible because you bring in more assumptions about the structure of the task), using approximated dynamic programming. you do step 2, otoh, in a kind-of stupid way by acting greedily wrt your current estimate from step 1.
But **crucially**, you still need both steps either way you're going about it. The reason you need (1) is that there's no teacher (which, for me, *is* the fundamental defining property of "what is RL").
Shameless plug: I've written a short paper about this earlier this year, from the perspective of RL in CogSci/Nuero (I know, I know -- we aren't allowed to refer to these fields on this blog...).
It turns out that quite a few people advocated for a "reformist" view there as well, promising that if we start and end with PG we will resolve a lot of conceptual issues with our models (of animal/human learning and behavior). I disagree with this claim, and I outlined the main argument of the short paper above here (there are a few more points discussed in the full text).
I will very much appreciate your thoughts, if you want to read it (short, i promise), it's available here: https://arxiv.org/abs/2505.04822
I will read it!
I like your definition a lot, BUT, there's a caveat I've seen in the community. Not every algorithm of the form:
1. Try to estimate how good you're currently doing
2. Adjust what you're currently doing so that you are [slightly] better the next time around.
gets to be RL! For instance, imitation learning works this way, but it's something else. As another example, in language models, you can do math reasoning by generating a bunch of answers to questions, finding the ones that pass a verifier, and then fine tuning on these. If you do this the way I propose, it's called "supervised fine tuning" and is also not counted as RL.
This is what makes RL so perplexing. It's everything and also a very specific thing at the same time...
Anyway, let me read your note first before I write a more detailed reply.
(trying to post again under the correct thread:)
With respect to the LLM scenario,
My understanding is that people in fact *do* call the "generate-verify-improve" thing "RL" in this context. In particular, this is different than supervised fine-tuning where you train in the usual MLE sense on hand-ceafted examples (say, a prompt+answer that some freelancer wrote for OpenAI). Here the model generates lots of outputs and you try to reinforce (ha) one particularl outputs, where the evaluation function is some blacbox (a model of "human preference", a formal verifier, etc). So you do some form of PG (and gets to reinvent a lot of new acronyms). This is a limited notion of RL (the "RLHF" almost feels to me as no RL at all, the formal verifiers maybe more so, but these are just vibes).
Regards imitation learning, that's a good point. It perhaps depends on what form/variation exactly. For straight behavioral cloning I'd say you don't really have to do step 1, because the teacher is doing it for you (you have reference actions which defines your target, you don't have to evaluate how good your policy is independently).
I like your paper! However, I think I'm after something slightly different here. I actually think the dynamic programming stuff is a total red herring when it comes to the appeal of "RL" methods. I like your definition of RL as
1. Try some stuff with your model
2. Score the outcomes
3. Update your model based on those scores
That's not a bad scope! But I'd argue that in reformist RL, you don't need to have an MDP to do 1-3. It applies to any problem where you can sample from a model and score the outputs. And this is why I think you can just start with policy gradient and see where you end up.
I tried to pull this out in the next post, highlighting that "finite differencing in random directions" is a policy gradient method even though there are no MDPs or online models in sight.
https://www.argmin.net/p/random-search-for-random-search
What do you think?
Thanks for reading!
I do agree that the focus of that paper is somewhat different (again it's written largely form the perspective of an internal neuro/CogSci debate).
I think the dynamic programming is a derivative of the specific realization chosen for step 2. Even if you "start with PG and see where you end up", if you take your objective to be "maximize expected cumulative discounted reward" you very quickly "end up" with the notion of Value functions (e.g. the PG theorem in Sutton et al. 1999).
And even within the framework of your next post, the fact that you know something extra about the objective is useful, for example, precisely for designing good baseline functions for your PG algorithm. (The something extra here is the dynamic programming _structure_ of your objective, or perhaps better to say: the optimal sub-substructure / optimality principle).
Having said all that, I think it's true that RL people are often fixating too much on the specific MDP formalism, even in situations where it doesn't make much sense, or doesn't "buy" you that much. And that in essence, you don't need MDPs for RL. The MDP framework is a particular _model_ for sequential decision making tasks (and a rather restrictive one), but not the only one possible.
I think it's worth noting that there are some "reformists"-like voices making similar or related claims from the very centre of the Albertean high-church.... e.g.: https://arxiv.org/pdf/2504.08161, and also in a talk format from the last RLDM here: https://www.youtube.com/watch?v=Vjx7r9mo0H0 . I thought it was a very nice talk, would recommend watching it.
(Finally -- calling it "my" definition of RL is a bit of a stretch! The intuition that this is the core of RL is at least as old as Thorndike's "Law of Effect")
I think Ben's point about not needing value functions in general is that they only arise if the cost functional has additional internal structure to it (e.g., if, to use Peter Whittle's apt term, it itself represents some sort of an "optimization over time" situation). Then, in addition to the time in the outer loop in Ben's RL meta-algorithm, we end up with "virtual" time in the inner loop where the objective is computed.
And, to add to this: Sutton-Barto style RL erases the difference between the "real" (outer-loop) time and "virtual" (inner-loop) time by collapsing both loops into one. This is unnecessarily confusing because it obscures the essence of the basic meta-algorithm, where only the "real" time is the one that matters because it governs the timescale on which the learning actually takes place.
Thanks again for this Ben! I'll always think that in another life you could have been a great comedian (professionally)
This reminds me of two points:
- in undergrad, we had a math professor who told us "every piece of applied math that's ever been useful is just integration by parts in a trenchcoat" - this somewhat applies here
- I've always found it strange that while 0-th order optimization is a tool in RL (with environments, MDPs, and all that), people us "RL" as shorthand for 0-th order optimization
I would love an explanation without the equations. I've been thinking about RL, at least in terms of the current slate of LLMs, as "optimization for a values system wherein nobody except machine learning engineers is able to select the values or has any say in why they are prioritized." Would love to know if that's an accurate description.
I think the heart of RL is that you only obtain feedback on the action (or action sequence) that you select. So bandit problems (esp. contextual bandits) are the purest form of RL (and they do not involve MDPs or dynamic programming etc.). We adopt the MDP formalism when the reward is (a) delayed and (b) additive because those allow us to do something more than just policy gradient. However, policy gradient has been winning in practice. Among the advantages of policy gradient are that it works for POMDPs and it works even when there is a major time scale mismatch between the individual actions (time steps) and the trajectory length. For long trajectories with "tiny" actions, the Q function (learned by DP) needs to make very fine distinctions, and even with tricks like Advantage Updating, these become absurdly sensitive to noise and roundoff error. Representing Q with a neural network is usually a disaster.
The post paints RL as more of a “culture/quasi-religion” than a technical area, and I’m not sure that generalization is fair. There’s definitely hype and some inflated claims -- particularly around commercialization -- but there’s also a wide spectrum of serious work and healthy internal skepticism.
"Policy gradient as the core" is a nice pedagogical simplification. But MDPs and value functions are not just liturgy: the MDP formalism is a minimal setting where temporal credit assignment is well-posed, and it’s a convenient sandbox for stress-testing ideas. Yes, it has limitations and is sometimes misapplied -- but so are most abstractions.
On the "RL is optimization" point: as you wrote, changing the weights changes the data distribution. But that is exactly why on-policy policy gradient is not "just optimization" in the same sense as vanilla stochastic gradient: the parameters determine what data you get to see next. Because updates change what gets sampled next, you can get genuine signal-to-noise / exploration issues. For example, if the updates drive the probability of a good action very low, one may almost never sample it again, which weakens the "learning signal". This coupling is what makes the problem interesting. While at the moment, most contributions to this problem come from the RL community, it would be great to see the control and OR folks also joining the fun (as sometimes they do).
Relatedly, these issues are where the on- vs. off-policy distinctions also earn their keep. Off-policy decouples data collection from the current policy, so "collapse of the current policy" is not the same bottleneck -- but one pays for it via coverage/support requirements and distribution-shift corrections (often via importance weighting), which can introduce their own variance and instability. On-policy avoids that particular mismatch, but then one must understand when the coupled learning-sampling dynamics do (or do not) self-sabotage.
At the risk of a bit of self-promotion, I’d be remiss not to mention that it was only very recently that we managed to show (in tabular/softmax settings) that constant-step-size on-policy policy gradient provably converges globally. The argument is (to me) quite fun, and credit is due especially to the student authors (Sam and Thang), who developed much of it; it relies on Borel-Cantelli-style reasoning together with delicate martingale machinery. The paper is here: https://openreview.net/pdf?id=YzriuQGaNX
Finally, a strong counterpoint to some of what is written here is what I frequently hear from practitioners: even when they start from PG, they often end up needing value estimates (baselines/critics) to make things work reliably. That looks less like ideology and more like empiricism.
Greetings from the frozen land of Alberta ❄️☃️❄️⛄️.
“This means that if you can sample from p(x; w) for any setting of the weights, you can run stochastic gradient ascent on the objective you care about.”
Did you mean “if you can evaluate f(x) at any sampled x”? Because you still need the gradient of log p(x; w).
And in the interesting cases p(x;w) is not available. But fancy math can tell us that we shouldn't be bothered by this: we can use the gradient of log probabilities of the action-sequence we generated. Sadly, to get to this conclusion one needs to assume more structure. Or we can call or a day and not use PG but go with whatever zero order optimizer we have at hand😄
And then you can always argue that your zero-order optimizer is PG under appropriate problem reformulation...
For a more general defn of RL, Why not just pi*=argmax E_piRwhere pi is a policy, R a reward. Why introduce idea of gradient? Can optimize in other ways. Also I disagree that it’s optimization “of” probability distributions; rather, I think it’s optimization over probability distributions
This is Demis Hassabis documentary on his life and interests (mostly RL) recently released (6 days ago) and watched 14million times. He has also received Nobel prize in Chemistry a few years ago, as we all know. A surprisingly rewarding end to his expore-exploit cycle.
https://youtu.be/d95J8yzvjbQ?si=lKnGaqFZSvR7q1LH | The thinking game | Tribeca festival official selection
But... "all of the applications of RL in language models* and 95% of applications of RL outside language models" do use a value function baseline, which comes from thinking about MDPs, dynamic programming, reward-to-go, and "all that fancy math".
*Yes, I conveniently omitted GRPO, but one could argue that this is less egregious than you conveniently omitting the idea of value function baselines entirely :)
Define "value function."
In this case I'm referring to something quite specific, which is the value function baseline learned via GAE that's used in the vast majority of policy gradient methods in practice (including LLM RL and sim-to-real robot locomotion).
Define "value function."
Now I feel like I'm feeding the troll here, but I defined it in the above comment: it's "the thing that is learned via GAE and subtracted from the reward-to-go in most real-world implementations of policy gradient methods".
Sutton recently mentioned TD-Gammon as a turning point success of RL. The basic form is in your reductionist RL domain. How it evolved to current BG products, w/optimized rollouts, separate game-stage networks, competitive NNs to abandon NNs that fell into ruts, transitions to deterministic near-end-of-game evaluators, end- user tunings of rollout params and related position setups, i.e., the engineering refinements, can be applied to other RLs. I'd just focus on that mature example as an RL description.