This is super interesting, thanks for sharing that paper and your thoughts.

It reminds me of two things that may [or may not] be somehow related:

1. There were these works on how in some problems choosing the optimal action is easy, even if the full evaluation is hard. In particular I'm reminded of this ICML 2016 paper by Şimşek, Algorta, and Kothiyal (https://proceedings.mlr.press/v48/simsek16.html) regarding Tetris. There is some theory in that paper explaining why and when this might be the case. I'm not sure how well it generalizes to domains such as ALE or MuJoCo, but at least on the level of the basic observation, things seem related maybe.

2. Even more speculative and hand-wavy, regarding the correlation of PI performance w/ PPO performance, maybe this is another hint that all these methods basically do the same thing. In particular, there is a close connection between TRPO and Natural Policy Gradient, and there is a (sort of) equivalence between NPG and PI, at least for tabular softmax policies (Kakade 2001). If the "deep" models rely heavily on memorization of individual state-actions then perhaps it's not entirely unreasonable that most grad steps go into "building the lookup table" and the actual PG part is smaller (hence the success where a single PI step would suffice).

1. One of the funny things about studying games is we always forget games are designed so people will play them. It cannot be the case that Tetris requires deep planning or else it would never have been successful. I'm looking forward to reading this paper that proves this to be the case.

2. As an optimization guy, I 100% agree with you that these methods are all very similar and none are particularly effective. When you dig into the details, they are all fairly naive derivative-free optimization methods. But they have so much crazy notation, math, and terminology that it's often hard to figure out what the methods are doing.

I haven't read the paper yet so not sure about the amount of memory involved, but just curious to see why do you consider a state space of 10 mil to be small? As far as ease is concerned, do you think it is due to the fact that most of these environments are deterministic? (Maybe that is why policy iteration converges quickly?)

I suppose 10M could be large if you had to specify all state transitions, but most of these demos are deterministic, so you just need a lookup table of size S x A.

I don't think determinism is the issue here. You can cook up some very hard deterministic search problems.

This paper does expose that DQN/PPO etc are not all that much. But, your takeaway is too bleak: RL has solved *some* hard problems (e.g, Go) but the key missing ingredients from doing this more generally is exploration, planning and maybe generalization, whatever that means. These are problems we can, and should, work on!

You are 100% right: RL is *great* at board games. If I had written a more careful post I would have mentioned this. But I think there is something decidedly special about board games. These might be the only "real" problems that are solvable using reinforcement learning.

But also, no one uses Go as a benchmark in their AI/ML papers. Why is that?

Long time reader (and appreciator) of your blog :)

I agree that board games are special: their model is either known or easily specifiable. This enables long-horizon planning and sidesteps exploration. Unfortunately for RL researchers, the model of the real world is not easily specifiable and there is no way around figuring out the hard problems of exploration and model-based RL. Would love to hear your take on what makes board games specially suitable for RL =)

I agree with you. Your comment reminds me of this quote from Arthur Samuel (1959) about why games are ideal for machine learning: "A game provides a convenient vehicle for such study as contrasted with a problem taken from life, since many of the complications of detail are removed."

I also think tree search carries a lot of the work in boardgames. Board games (and to some extent poker) are easier to specify as game trees than many other RL problems.

Just a (late) clarification on the state space size: we limited the horizon of the Atari games in this paper to make it tractable to enumerate the entire state space. On full-horizon Atari games, the state spaces are exponentially larger.

However, we did find that our RL algorithm, GORP, which implements approximate one-step lookahead—i.e., one step of approximate policy iteration—does just as well as PPO and DQN on the full-horizon Atari environments too (Figure 3 in the paper). So even though we can't exactly verify that one step of policy iteration solves them, this appears to be the case in practice.

This is super interesting, thanks for sharing that paper and your thoughts.

It reminds me of two things that may [or may not] be somehow related:

1. There were these works on how in some problems choosing the optimal action is easy, even if the full evaluation is hard. In particular I'm reminded of this ICML 2016 paper by Şimşek, Algorta, and Kothiyal (https://proceedings.mlr.press/v48/simsek16.html) regarding Tetris. There is some theory in that paper explaining why and when this might be the case. I'm not sure how well it generalizes to domains such as ALE or MuJoCo, but at least on the level of the basic observation, things seem related maybe.

2. Even more speculative and hand-wavy, regarding the correlation of PI performance w/ PPO performance, maybe this is another hint that all these methods basically do the same thing. In particular, there is a close connection between TRPO and Natural Policy Gradient, and there is a (sort of) equivalence between NPG and PI, at least for tabular softmax policies (Kakade 2001). If the "deep" models rely heavily on memorization of individual state-actions then perhaps it's not entirely unreasonable that most grad steps go into "building the lookup table" and the actual PG part is smaller (hence the success where a single PI step would suffice).

1. One of the funny things about studying games is we always forget games are designed so people will play them. It cannot be the case that Tetris requires deep planning or else it would never have been successful. I'm looking forward to reading this paper that proves this to be the case.

2. As an optimization guy, I 100% agree with you that these methods are all very similar and none are particularly effective. When you dig into the details, they are all fairly naive derivative-free optimization methods. But they have so much crazy notation, math, and terminology that it's often hard to figure out what the methods are doing.

I haven't read the paper yet so not sure about the amount of memory involved, but just curious to see why do you consider a state space of 10 mil to be small? As far as ease is concerned, do you think it is due to the fact that most of these environments are deterministic? (Maybe that is why policy iteration converges quickly?)

I suppose 10M could be large if you had to specify all state transitions, but most of these demos are deterministic, so you just need a lookup table of size S x A.

I don't think determinism is the issue here. You can cook up some very hard deterministic search problems.

edited Jul 1, 2023This paper does expose that DQN/PPO etc are not all that much. But, your takeaway is too bleak: RL has solved *some* hard problems (e.g, Go) but the key missing ingredients from doing this more generally is exploration, planning and maybe generalization, whatever that means. These are problems we can, and should, work on!

My first substack comment! I appreciate you.

You are 100% right: RL is *great* at board games. If I had written a more careful post I would have mentioned this. But I think there is something decidedly special about board games. These might be the only "real" problems that are solvable using reinforcement learning.

But also, no one uses Go as a benchmark in their AI/ML papers. Why is that?

Long time reader (and appreciator) of your blog :)

I agree that board games are special: their model is either known or easily specifiable. This enables long-horizon planning and sidesteps exploration. Unfortunately for RL researchers, the model of the real world is not easily specifiable and there is no way around figuring out the hard problems of exploration and model-based RL. Would love to hear your take on what makes board games specially suitable for RL =)

I agree with you. Your comment reminds me of this quote from Arthur Samuel (1959) about why games are ideal for machine learning: "A game provides a convenient vehicle for such study as contrasted with a problem taken from life, since many of the complications of detail are removed."

I also think tree search carries a lot of the work in boardgames. Board games (and to some extent poker) are easier to specify as game trees than many other RL problems.

edited Dec 1, 2023Thank you for sharing our work!

Just a (late) clarification on the state space size: we limited the horizon of the Atari games in this paper to make it tractable to enumerate the entire state space. On full-horizon Atari games, the state spaces are exponentially larger.

However, we did find that our RL algorithm, GORP, which implements approximate one-step lookahead—i.e., one step of approximate policy iteration—does just as well as PPO and DQN on the full-horizon Atari environments too (Figure 3 in the paper). So even though we can't exactly verify that one step of policy iteration solves them, this appears to be the case in practice.