Discussion about this post

User's avatar
Lior Fox's avatar

This is super interesting, thanks for sharing that paper and your thoughts.

It reminds me of two things that may [or may not] be somehow related:

1. There were these works on how in some problems choosing the optimal action is easy, even if the full evaluation is hard. In particular I'm reminded of this ICML 2016 paper by Şimşek, Algorta, and Kothiyal (https://proceedings.mlr.press/v48/simsek16.html) regarding Tetris. There is some theory in that paper explaining why and when this might be the case. I'm not sure how well it generalizes to domains such as ALE or MuJoCo, but at least on the level of the basic observation, things seem related maybe.

2. Even more speculative and hand-wavy, regarding the correlation of PI performance w/ PPO performance, maybe this is another hint that all these methods basically do the same thing. In particular, there is a close connection between TRPO and Natural Policy Gradient, and there is a (sort of) equivalence between NPG and PI, at least for tabular softmax policies (Kakade 2001). If the "deep" models rely heavily on memorization of individual state-actions then perhaps it's not entirely unreasonable that most grad steps go into "building the lookup table" and the actual PG part is smaller (hence the success where a single PI step would suffice).

Expand full comment
Palash Chatterjee's avatar

I haven't read the paper yet so not sure about the amount of memory involved, but just curious to see why do you consider a state space of 10 mil to be small? As far as ease is concerned, do you think it is due to the fact that most of these environments are deterministic? (Maybe that is why policy iteration converges quickly?)

Expand full comment
7 more comments...

No posts