Discussion about this post

User's avatar
reed hundt's avatar

Yes do please

Lior Fox's avatar

Oh boy. This one is hitting close to home...

I totally get the appeal for a "reformist" view. I'm also sympathetic to the idea that all these RL methods basically do "the same" thing, and that this "thing" is a rather simple random search style [for those who remember, we discussed this here: https://open.substack.com/pub/argmin/p/why-is-reinforcement-learning-so?utm_campaign=comment-list-share-cta&utm_medium=web&comments=true&commentId=18055037]

Having said all that, I'm also not sure that the "pure PG" view totally saves you here.

To explain why, I'm going to take the liberty and write my version of your promised blogpost:

So here's RL without equations. In order to learn-by-doing, you iterate the following two steps:

1. Try to estimate how good you're currently doing

2. Adjust what you're currently doing so that you are [slightly] better the next time around.

This is basically what Sutton and Barto termed "Generalized Policy Iteration" in their book. The 2 steps doesn't have to be as distinct, can be partial/parallel, etc.

Now,

- In a "vanilla" REINFORCE, you do step 1 in the most straightforward / naive way possible, by just collecting a handful of samples and relying on Monte-Carlo, and you do step 2 in a "sophisticated" way of taking gradients of your policy directly.

- In TD, you do step 1 in a slightly more sophisticated way (which is possible because you bring in more assumptions about the structure of the task), using approximated dynamic programming. you do step 2, otoh, in a kind-of stupid way by acting greedily wrt your current estimate from step 1.

But **crucially**, you still need both steps either way you're going about it. The reason you need (1) is that there's no teacher (which, for me, *is* the fundamental defining property of "what is RL").

Shameless plug: I've written a short paper about this earlier this year, from the perspective of RL in CogSci/Nuero (I know, I know -- we aren't allowed to refer to these fields on this blog...).

It turns out that quite a few people advocated for a "reformist" view there as well, promising that if we start and end with PG we will resolve a lot of conceptual issues with our models (of animal/human learning and behavior). I disagree with this claim, and I outlined the main argument of the short paper above here (there are a few more points discussed in the full text).

I will very much appreciate your thoughts, if you want to read it (short, i promise), it's available here: https://arxiv.org/abs/2505.04822

22 more comments...

No posts

Ready for more?