Discussion about this post

User's avatar
Sarah Dean's avatar

One axis that isn't clearly included here is online vs. offline data collection. In many cases this might not be particularly important, but I think it's crucial for distinguishing bandits from "full" RL. I wouldn't identify few vs. many actions as the difference between bandits and RL -- there are bandit problems with complex action spaces, and RL problems with small action spaces.

I think "sequential" is hiding two distinct phenomena: data collection vs. dynamics/impacts. In bandit settings, the only long term effects of actions are due to online data collection, while the underlying environment remains unaffected. Something I struggle with is succinctly naming this second meaning of sequential, in which actions impact the decision environment. "Dynamics" isn't quite right, since an environment can be non-stationary but uncontrollable. I often use the word "impact" but I'm not sure it gets the point across. It basically boils down to whether the optimal (model-based) policy is greedy (H=1) or not.

Expand full comment
Kevin's avatar

Nice post! I had a thought / reaction which is very similar to Sarah's. I usually distinguish between "information state" MDPs and "normal" MDPs. In the former, the environment is unchanged by the agent's actions, but their belief states about the reward function changes. (For finite number of actions, this gives contextual bandits, for continuous action spaces, this gives something related to BayesOpt - see this nice tutorial by Marc Toussaint for the connnections, https://www.youtube.com/watch?v=5rev-zVx1Ps). In the "normal" MDP case, you need to model dynamics of the environment and dynamics of the belief state .

I think you can unify both if you define a joint model like this:

p(bel(t), env(t), act(t), obs(t) | bel(t-1), env(t-1))

= p(act(t) | bel(t-1)) // policy

* p(obs(t) | env(t-1), act(t)) // nature's "reward" model

* p(bel(t) | bel(t-1), act(t), obs(t)) // deterministic Bayes rule (or some other estimator)

* p(env(t) | env(t-1), act(t)) // for bandits, actions have no impact

Expand full comment
2 more comments...

No posts