Defining Reinforcement Learning Down

It's a lot simpler than I realized.

Dec 03, 2025

On Monday, I described reformist reinforcement learning in a paragraph. Today I’ll do it in about 800 words with no equations. I’m indebted to Lior Fox, who, in a back-and-forth in the comment section, helped me synthesize a far better definition than I had.

Lior, a cognitive scientist, has a broad definition of reinforcement learning rooted in the psychological concept of reinforcement learning. A century before computers were playing backgammon with TD Learning, psychologists were building theories of human learning based on feedback. Paraphrasing Thorndike’s Law of Effect, Lior defines reinforcement learning as the iterative process:

Receive external validation on how good you’re currently doing
Adjust what you’re currently doing so that you are better the next time around.

Whether or not this is how humans or animals learn, this is a spot-on definition of computer scientific reinforcement learning. Let me make it more precise.

We have a computer program in feedback with an evaluation environment. The computer produces responses to a series of tests. An external agent then numerically scores these responses. The computer program is fed these scores and internally updates its software for the subsequent evaluation. The process repeats. The goal of the iteration is to produce a program that achieves the highest possible average score when interacting with the evaluation environment. In bullets, computational reinforcement learning is the iterative process:

Produce a collection of responses to evaluation scenarios.
Receive scores on these responses.
Update the computer code based on these scores.

Reinforcement learning is thus a branch of optimization. The objective is to maximize the average score if the program were to be evaluated an infinite number of times. The optimization process is iterative, with feedback based solely on the scores.

You can cook up a lot of examples where you might be able to use a method like this. You could have a game-playing agent play a video game a bunch of times, adapting its strategy based on the score of each round. You could have an autonomous racing quadrotor that iteratively tries new maneuvers to improve its time around a course. You could have a language model whose responses are scored by remote low-wage labor, fine-tuned to match the workers’ preferences. All of these would count as examples of reinforcement learning.

Reformist RL uses a very particular implementation of the update step 3. First, the computer code is a generative model. This model interactively takes input from the evaluation environment and returns a sequence of random samples. In our examples above, the video-game-player randomly chooses its next moves, the quadrotor randomly perturbs its previous headings, and the language model randomly generates its next tokens.

When building generative models from data, the goal is to maximize the probability of some given dataset. In Reformist RL, the generative model generates the data itself. The training data are the records from the evaluation in step 1 and the received scores in step 2. In step 3, you update the generative model by training only on data with positive scores.1 That is, whenever the computer receives high scores, it increases the probability of responding the same way the next time it sees the same test in the evaluation environment.

If you already have code to build generative models, then you can easily turn this code into a reinforcement learning agent. Simply add weights to the updates proportional to the scores received in the evaluation phase. That’s it. This is called policy gradient. The ease of implementation is why Reformist RL is so enticing in large language models. Any code for pretraining can be quickly modified for posttraining. The models are pretrained on sequences of text from the internet. They are postrained on sequences of text generated in various evaluation environments.

Now you might ask, what about all of those Markov Decision Processes that professors torture you with in AI classes? Though it has historically been the other way around in classical artificial intelligence, Reformist RL, like behaviorist psychology, views MDPs as secondary rather than primary. MDPs arise in a very particular evaluation environment where

The computer is scored on a sequence of tests.
The evaluation environment chooses its next test in the sequence only as a function of the current test and the computer’s current answer.
Each test receives progressively less weight in the total score of the evaluation (i.e., discounting).

You can phrase a lot of problems this way, but this comprises a subset of reinforcement learning problems, not a superset.

This characterization of reinforcement learning and Reformist Reinforcement Learning captures every example I know of. It connects nicely to the rest of machine learning. You can teach it in a single class. My survey of reinforcement learning goes from 27 pages to 809 words. I have learned from experience.

Or, more precisely, the higher the score, the more likely you make your response there.

Lior Fox

Dec 3

(Seriously thrilled for being cited on an argmin post...! I'm also very much looking forward for the rest of the RL week)

There's a subtle but important point of difference:

You quote my version of step 1 as: "receive external validation". While the way I wrote (and understand) it is that "evaluate how well you're doing" is an internal task of the agent (because there's no teacher to do it for you. From that perspective, "reward" is simply yet another kind of observation, or even an internal interpretation of some observations. In fact Andrew Barto has written quite a lot on this point.).

It is precisely in this stage (of "evaluate") that the construct of Value appears, in the "standard"/Orthodox model (of MDPs, which I absolutely agree are restrictive), and why I think, again in the context of that RLDM short paper, that the move for PG models *by itself* doesn't solve the fundamental conceptual issues associated with Value, on the neuroscience side of things.

(Thanks to Yoav Goldberg for initiating the short discussion with me about this point on Twitter. I'm posting this here for the wider audience and curious as for how and if you see the difference!)

7 replies by Ben Recht and others

Michael Wick

Dec 4

Long time listener first time caller. Great blog BTW.

At a high level could we consider Lagrangian relaxation or dual decomposition a form of RL? Say we're constraining the output of a generative model — the model generates something, gets feedback via constraint violations and we update the dual variables so it does better in the next round.

2 replies by Ben Recht and others

15 more comments...

arg min

Discussion about this post

Ready for more?