arg min

With respect to the idea that an agent can self-evaluate, I can see how this is useful in neuroscience, but it is a pipedream in engineering practice. Cost function design is part of engineering. No RL systems actually do this, so that's why I changed your definition.

I also really don't like how Reformist RL people throw around the terms "value function" and "advantage function." These have precise definitions in the context of MDPs/dynamic programming, but people writing RL papers seldom even know what they are. And when you press them for definitions, you get weird stuff back!

(For example, I have one person in the comments trying to scold me about how DeepSeek uses value functions in training LLMs, and that person definitely doesn't know the definition of value functions.)

Yoav Goldberg

Dec 4Edited

wrt self evaluation being a pipe-dream in engineering, while I agree it certainly isn't how things *are* currently done, I do think this particular aspect is actually easy to fix (and am quite perplexed as to why it isn't done this way already).

Even if the reward function is not learned, it can still be considered part of the agent and not part of the environment. The agent performs an action, and sees an observation. It then (the agent, not the environment) translates this observation into a reward (in practice it can be done using the same fixed engineered function that computes the reward based on the environment).

While this may seem like purely a notational difference (and to a large extent this is just a notation difference), I do think it is cleaner and more aligned with Lior's definition. And it also allows some more flexibility, as you can have two agents in the same environment, but with different "goals" (which are translated two different reward functions), who will end up learning different policies.

Dec 5

Yeah, this makes a lot of sense. But I do think that it's important to emphasize that this is *not* how computation RL is done or conceived right now. Let me give a simple example to show why the distinction is important: on Tuesday, I showed you could use RL to maximize functions. The resulting algorithm was a finite-difference approximation to gradient descent along randomly chosen lines. This algorithm only makes sense if the optimizer doesn't have access to the reward function. If you had the reward function, normal gradient descent would be far, far faster than RL.

I know there are some cases where you have the reward function, but it's still hard to compute a good gradient or estimate an environment model cleanly. But just because it is hard doesn't mean it's impossible. And when you do this, you'll always be more efficient and reliable. I tried to write about this in more detail in today's post.

Yoav Goldberg

Dec 5

wait... you have the reward function but you still don't control its inputs. So how can you do normal GD? what am I missing?

(also, I agree, this is not how things are done. But I do think this is how they should be.)

https://gist.github.com/yoavg/3eb3e722d38e887a0a8ac151c62d9617

Dec 6

I liked your gist:

(I'm linking it here so other argmin readers will see it.)

My only disagreement is that you are describing something that is too big for me to call reinforcement learning. I don't want to let RL have this giant umbrella where it's all of autonomous systems theory. They haven't earned this.

Your broad definition applies perfectly well to a thermostat. I refuse to call a thermostat a reinforcement learning system. Thermostats don't run policy gradient! What they do is much simpler and much more effective.

I want the narrowest possible definition of reinforcement learning. That field has been overly technical, misleading, and misguided, and should have as small a surface area as possible. If anything, I want to narrow my definition from Wednesday's post: you could argue my definition would include the Perceptron. The perceptron, in the language of cogsci, is punishment learning, not reinforcement learning.

Eli

Lior, are you signing up for my Abolish the Value Function movement?

Lior Fox

I'm not so sure we should.

Value Function is a direct consequence of a particular style of sequential decision making process with a specific objective. While restrictive and not fully general, it's a model that might still capture a lot.

But even more generally, the idea of "breaking down" the problem by estimating how "good" individual states are is probably a very useful abstraction for learning and behavior.

I think this abstraction might actually go beyond the standard MDP case. For that we definitely need to _revise_ the concept of what exactly is a Value Function, because the standard one ceases to exist very quickly once you start relaxing the assumptions of the "standard" model.

Michael Wick

Long time listener first time caller. Great blog BTW.

At a high level could we consider Lagrangian relaxation or dual decomposition a form of RL? Say we're constraining the output of a generative model — the model generates something, gets feedback via constraint violations and we update the dual variables so it does better in the next round.

I'm intrigued by this connection. Do you have a specific generative model and constraint set in mind?

Michael Wick

This may not be the best example, but what comes to mind is many summers ago we wanted to put hard constraints on the output of a seq2seq model. So the model is a “grammar as a foreign language” style RNN that could translate between natural language sentences and a linearization of their parse tree (S-EXP). The constraints are things like all words in the input must appear appear in the outputted S-EXP, no extra words can appear in the S-EXP, the S-EXP must be well-formed, etc. These are almost like ‘verifiable rewards’ in a sense, but could also be transcribed into linear constraints. If we consider the model and dual variables collectively we could imagine solving this with Reinforce on the dual variables which I think would be equivalent or very similar to lagrangian relaxation. It also potentially opens the door to do a function approximation of the dual variables with the model and update the model weights instead using backprop. Now it looks even more like RL though the non linearity of the model probably destroys any guarantees lagrangian relaxation normally affords

Maxim Raginsky

I have a 1969 book called _Random Processes and Learning_ by two Romanian mathematicians, Iosifescu and Theodorescu. Part of their project was to develop a mathematical theory to go with behaviorist psychology, and they specifically emphasized the importance of going beyond Markov models to what they called "random systems with complete connections." None of this is referenced in the Sutton & Barto book (which I recently bought used for $20 because Lana said I should have a copy "to yell at").

Eli

So... semi-Markov models? Or potentially infinitely autoregressive processes?

Maxim Raginsky

The latter. The basic model is like this: You have two coupled processes x_k and w_k, where x_{k+1} is sampled conditionally on x_k and w_k, but then w_{k+1} is updated depending on w_k and x_{k+1}. In this way, x_k can be made to depend on the entire history w_0, ..., w_{k-1}.

Spearman

Jan 9Edited

The three-step iterative process for computational RL, it sounds like stochastic gradient descent would be a specific case:

1. Produce a collection of responses to evaluation scenarios. (Evaluate the model on training data)

2. Receive scores on these responses. (Compute the error)

3. Update the computer code based on these scores. (Do backpropagation and update the model weights)

Does this mean SGD is a kind of computational RL?

Alex Tolley

How effective is RL against other optimization methods? For example, Simulated Annealing?

In the grander scheme of things, Darwinian Evolution is a different mechanism from learning, i.e., the individual improving its responses, by using the criterion of reproductive success. While true evolutionary genetic algorithms would drive the cost of needed data centers to astronomical levels, they do allow for many goals to be optimized through reproduction, whilst genetic algorithms typically have only one optimization goal to reduce computational effort, although multiple goals could be allowed with more computational effort.