16 Comments
User's avatar
Lior Fox's avatar

(Seriously thrilled for being cited on an argmin post...! I'm also very much looking forward for the rest of the RL week)

There's a subtle but important point of difference:

You quote my version of step 1 as: "receive external validation". While the way I wrote (and understand) it is that "evaluate how well you're doing" is an internal task of the agent (because there's no teacher to do it for you. From that perspective, "reward" is simply yet another kind of observation, or even an internal interpretation of some observations. In fact Andrew Barto has written quite a lot on this point.).

It is precisely in this stage (of "evaluate") that the construct of Value appears, in the "standard"/Orthodox model (of MDPs, which I absolutely agree are restrictive), and why I think, again in the context of that RLDM short paper, that the move for PG models *by itself* doesn't solve the fundamental conceptual issues associated with Value, on the neuroscience side of things.

(Thanks to Yoav Goldberg for initiating the short discussion with me about this point on Twitter. I'm posting this here for the wider audience and curious as for how and if you see the difference!)

Expand full comment
Ben Recht's avatar

With respect to the idea that an agent can self-evaluate, I can see how this is useful in neuroscience, but it is a pipedream in engineering practice. Cost function design is part of engineering. No RL systems actually do this, so that's why I changed your definition.

I also really don't like how Reformist RL people throw around the terms "value function" and "advantage function." These have precise definitions in the context of MDPs/dynamic programming, but people writing RL papers seldom even know what they are. And when you press them for definitions, you get weird stuff back!

(For example, I have one person in the comments trying to scold me about how DeepSeek uses value functions in training LLMs, and that person definitely doesn't know the definition of value functions.)

Expand full comment
Yoav Goldberg's avatar

wrt self evaluation being a pipe-dream in engineering, while I agree it certainly isn't how things *are* currently done, I do think this particular aspect is actually easy to fix (and am quite perplexed as to why it isn't done this way already).

Even if the reward function is not learned, it can still be considered part of the agent and not part of the environment. The agent performs an action, and sees an observation. It then (the agent, not the environment) translates this observation into a reward (in practice it can be done using the same fixed engineered function that computes the reward based on the environment).

While this may seem like purely a notational difference (and to a large extent this is just a notation difference), I do think it is cleaner and more aligned with Lior's definition. And it also allows some more flexibility, as you can have two agents in the same environment, but with different "goals" (which are translated two different reward functions), who will end up learning different policies.

Expand full comment
Ben Recht's avatar

Yeah, this makes a lot of sense. But I do think that it's important to emphasize that this is *not* how computation RL is done or conceived right now. Let me give a simple example to show why the distinction is important: on Tuesday, I showed you could use RL to maximize functions. The resulting algorithm was a finite-difference approximation to gradient descent along randomly chosen lines. This algorithm only makes sense if the optimizer doesn't have access to the reward function. If you had the reward function, normal gradient descent would be far, far faster than RL.

I know there are some cases where you have the reward function, but it's still hard to compute a good gradient or estimate an environment model cleanly. But just because it is hard doesn't mean it's impossible. And when you do this, you'll always be more efficient and reliable. I tried to write about this in more detail in today's post.

Expand full comment
Yoav Goldberg's avatar

wait... you have the reward function but you still don't control its inputs. So how can you do normal GD? what am I missing?

(also, I agree, this is not how things are done. But I do think this is how they should be.)

Expand full comment
Ben Recht's avatar

I liked your gist:

https://gist.github.com/yoavg/3eb3e722d38e887a0a8ac151c62d9617

(I'm linking it here so other argmin readers will see it.)

My only disagreement is that you are describing something that is too big for me to call reinforcement learning. I don't want to let RL have this giant umbrella where it's all of autonomous systems theory. They haven't earned this.

Your broad definition applies perfectly well to a thermostat. I refuse to call a thermostat a reinforcement learning system. Thermostats don't run policy gradient! What they do is much simpler and much more effective.

I want the narrowest possible definition of reinforcement learning. That field has been overly technical, misleading, and misguided, and should have as small a surface area as possible. If anything, I want to narrow my definition from Wednesday's post: you could argue my definition would include the Perceptron. The perceptron, in the language of cogsci, is punishment learning, not reinforcement learning.

Expand full comment
Eli's avatar

Lior, are you signing up for my Abolish the Value Function movement?

Expand full comment
Lior Fox's avatar

I'm not so sure we should.

Value Function is a direct consequence of a particular style of sequential decision making process with a specific objective. While restrictive and not fully general, it's a model that might still capture a lot.

But even more generally, the idea of "breaking down" the problem by estimating how "good" individual states are is probably a very useful abstraction for learning and behavior.

I think this abstraction might actually go beyond the standard MDP case. For that we definitely need to _revise_ the concept of what exactly is a Value Function, because the standard one ceases to exist very quickly once you start relaxing the assumptions of the "standard" model.

Expand full comment
Michael Wick's avatar

Long time listener first time caller. Great blog BTW.

At a high level could we consider Lagrangian relaxation or dual decomposition a form of RL? Say we're constraining the output of a generative model — the model generates something, gets feedback via constraint violations and we update the dual variables so it does better in the next round.

Expand full comment
Ben Recht's avatar

I'm intrigued by this connection. Do you have a specific generative model and constraint set in mind?

Expand full comment
Michael Wick's avatar

This may not be the best example, but what comes to mind is many summers ago we wanted to put hard constraints on the output of a seq2seq model. So the model is a “grammar as a foreign language” style RNN that could translate between natural language sentences and a linearization of their parse tree (S-EXP). The constraints are things like all words in the input must appear appear in the outputted S-EXP, no extra words can appear in the S-EXP, the S-EXP must be well-formed, etc. These are almost like ‘verifiable rewards’ in a sense, but could also be transcribed into linear constraints. If we consider the model and dual variables collectively we could imagine solving this with Reinforce on the dual variables which I think would be equivalent or very similar to lagrangian relaxation. It also potentially opens the door to do a function approximation of the dual variables with the model and update the model weights instead using backprop. Now it looks even more like RL though the non linearity of the model probably destroys any guarantees lagrangian relaxation normally affords

Expand full comment
Maxim Raginsky's avatar

I have a 1969 book called _Random Processes and Learning_ by two Romanian mathematicians, Iosifescu and Theodorescu. Part of their project was to develop a mathematical theory to go with behaviorist psychology, and they specifically emphasized the importance of going beyond Markov models to what they called "random systems with complete connections." None of this is referenced in the Sutton & Barto book (which I recently bought used for $20 because Lana said I should have a copy "to yell at").

Expand full comment
Eli's avatar

So... semi-Markov models? Or potentially infinitely autoregressive processes?

Expand full comment
Maxim Raginsky's avatar

The latter. The basic model is like this: You have two coupled processes x_k and w_k, where x_{k+1} is sampled conditionally on x_k and w_k, but then w_{k+1} is updated depending on w_k and x_{k+1}. In this way, x_k can be made to depend on the entire history w_0, ..., w_{k-1}.

Expand full comment
Alex Tolley's avatar

How effective is RL against other optimization methods? For example, Simulated Annealing?

In the grander scheme of things, Darwinian Evolution is a different mechanism from learning, i.e., the individual improving its responses, by using the criterion of reproductive success. While true evolutionary genetic algorithms would drive the cost of needed data centers to astronomical levels, they do allow for many goals to be optimized through reproduction, whilst genetic algorithms typically have only one optimization goal to reduce computational effort, although multiple goals could be allowed with more computational effort.

Expand full comment
Ben Recht's avatar

As with everything in optimization methods, the specifics depend on the problem specifics. Sometimes, simulated annealing is better. Sometimes, RL is better. Sometimes, RL and SA give you the same algorithm.

In both cases, SA and RL are best thought of as meta-algorithms. They are a scaffolding for search in which you place problem-specific details when solving your specific problem.

Expand full comment