Nov 8, 2023Liked by Ben Recht

One axis that isn't clearly included here is online vs. offline data collection. In many cases this might not be particularly important, but I think it's crucial for distinguishing bandits from "full" RL. I wouldn't identify few vs. many actions as the difference between bandits and RL -- there are bandit problems with complex action spaces, and RL problems with small action spaces.

I think "sequential" is hiding two distinct phenomena: data collection vs. dynamics/impacts. In bandit settings, the only long term effects of actions are due to online data collection, while the underlying environment remains unaffected. Something I struggle with is succinctly naming this second meaning of sequential, in which actions impact the decision environment. "Dynamics" isn't quite right, since an environment can be non-stationary but uncontrollable. I often use the word "impact" but I'm not sure it gets the point across. It basically boils down to whether the optimal (model-based) policy is greedy (H=1) or not.

Expand full comment

Excellent point. I love "impact" as the key distinction. In some sense, for bandit problems, the data collection can be "online" but as you note, it's impossible to tell the difference between online and offline. The arms wouldn't change either way. Similarly, if you solve some stochastic optimization problem with online gradient descent, you'll get the same answer as if you solved it with any other global method.

But then there are problems where your decision at stage 1 impact the outcomes at stage 2. This is where you have to grapple with the key issues of multi-stage planning, feedback, etc.

I even like the term "impact." Why don't you think it gets across what you are trying to explain?

Expand full comment

Well, "impact" is the best thing I've come up with so far. And I also like it! But sometimes I feel like it doesn't get across the idea of sequential decision making. Without further explanation, I think people often understand "impact" as distinct from "control" or even from MDPs.

Expand full comment

Nice post! I had a thought / reaction which is very similar to Sarah's. I usually distinguish between "information state" MDPs and "normal" MDPs. In the former, the environment is unchanged by the agent's actions, but their belief states about the reward function changes. (For finite number of actions, this gives contextual bandits, for continuous action spaces, this gives something related to BayesOpt - see this nice tutorial by Marc Toussaint for the connnections, https://www.youtube.com/watch?v=5rev-zVx1Ps). In the "normal" MDP case, you need to model dynamics of the environment and dynamics of the belief state .

I think you can unify both if you define a joint model like this:

p(bel(t), env(t), act(t), obs(t) | bel(t-1), env(t-1))

= p(act(t) | bel(t-1)) // policy

* p(obs(t) | env(t-1), act(t)) // nature's "reward" model

* p(bel(t) | bel(t-1), act(t), obs(t)) // deterministic Bayes rule (or some other estimator)

* p(env(t) | env(t-1), act(t)) // for bandits, actions have no impact

Expand full comment