Discussion about this post

User's avatar
Lior Fox's avatar

(Seriously thrilled for being cited on an argmin post...! I'm also very much looking forward for the rest of the RL week)

There's a subtle but important point of difference:

You quote my version of step 1 as: "receive external validation". While the way I wrote (and understand) it is that "evaluate how well you're doing" is an internal task of the agent (because there's no teacher to do it for you. From that perspective, "reward" is simply yet another kind of observation, or even an internal interpretation of some observations. In fact Andrew Barto has written quite a lot on this point.).

It is precisely in this stage (of "evaluate") that the construct of Value appears, in the "standard"/Orthodox model (of MDPs, which I absolutely agree are restrictive), and why I think, again in the context of that RLDM short paper, that the move for PG models *by itself* doesn't solve the fundamental conceptual issues associated with Value, on the neuroscience side of things.

(Thanks to Yoav Goldberg for initiating the short discussion with me about this point on Twitter. I'm posting this here for the wider audience and curious as for how and if you see the difference!)

Michael Wick's avatar

Long time listener first time caller. Great blog BTW.

At a high level could we consider Lagrangian relaxation or dual decomposition a form of RL? Say we're constraining the output of a generative model — the model generates something, gets feedback via constraint violations and we update the dual variables so it does better in the next round.

15 more comments...

No posts

Ready for more?