Greetings Professor Falken

Nov 16, 2023

The strange game of policy optimization on long time horizons

7 Comments

Nov 16, 2023

If you repeatedly solve short-horizon planning problems via value iteration... is it MPC or DP? To me DP is an algorithm for finding globally optimal plans/trajectories via brute force. MPC is a choice about which problems your system is going to solve in order to do something useful in the world. I think that most people implement MPC using local optimization. I like to think about optimal control in terms of Maximum Principle vs Dynamic programming, aka local optimization vs global optimization, aka Soviets vs RAND corp :)

Expand full comment

Reply (1)

Ben Recht

Nov 16, 2023

> If you repeatedly solve short-horizon planning problems via value iteration... is it MPC or DP?

What kind of maniac would do that? ;)

But more seriously, I see what you're saying and I agree. Perhaps what I meant to distinguish was the choice:

(a) solve the long time horizon problem approximately using techniques we commonly know as ADP. Take the policy from this and run with it

(b) solve a short time horizon problem to high accuracy, take one step and resolve.

But you are proposing a third plan:

(b2) solve a short time horizon problem to low accuracy, take one step and resolve.

Perhaps not surprisingly, (b2) can be pretty good in practice. Feedback overcomes a lot. But I think (b2) is a second design decision after the branch between (a) and (b).

What do you think?

I like your maximum principle vs dynamic programming axis too. That's a good one. As always, control design is multifaceted. Although by the 1960s, I don't think this local vs global was cleanly delineated by the Iron Curtain (see, for example, Bryson).

Expand full comment

Reply (2)

Bosque

Nov 17, 2023

I like your distinction about long vs short horizon problems. I would argue though that MPC is not necessarily motivated because long-horizon problems are intractable and short horizon problems are not. Both DP and local optimization approaches scale linearly with time horizon so it’s hard to delineate a point at which the horizon is too distant. I guess the point I was trying to make is that (IMO) DP is just kind of a dumb approach to solving planning problems. Even for short horizon problems it is usually intractable for systems of moderate state dimension. The whole point of DP is to find an optimal plan from anywhere/everywhere, which is almost never what you want. All of the RL methods that are based on (A)DP I guess get around this by only performing DP updates on states sampled by experience, or in other words find plans that are good only where it matters. But that is the entire philosophy behind local optimization / maximum principle. I think that is why it is so maddening to me when people use RL in situations where deriv info is available and a proper trajectory optimization algorithm could be used to find plans. RL IS definitely useful for amortizing computation which can still be too slow in local opt for online decision making. But I still believe that RL policy learning should be based on maximum principle rather than DP when possible.

Heres a cool article with some history about two camps of optimal control in the 50s:

https://www.math.uni-bielefeld.de/documenta/vol-ismp/48_pesch-hans-josef-cold-war.pdf

Expand full comment

Reply (1)

Ben Recht

Nov 17, 2023

Great points. Do you have thoughts on what RL based on the maximum principle look like?

I'm going to blog about RL next week (gulp), and I'm sure I'll get plenty of hate mail. But the term "RL" seems to have lost all of its meaning. In the 90s, we agreed it was approximate dynamic programming. But now, it seems to be any time machine learning is used in something that might look like a control setting.

Honestly, sometimes it's even beyond this: RLHF is supervised learning. It's called "RLHF" because they use a garbage algorithm stolen from RL to badly solve their optimization problem. I have no patience this RL imperialism.

Expand full comment

Lalitha Sankar

Nov 23, 2023

A maniac who works on resource allocation for the electric grid, Ben! We need to continuously allocate DER resources for example to manage ancillary and real-time energy demands :).

Expand full comment

Justin Bayer

Nov 16, 2023

But how do you "approximate the terminal cost in some reasonable way" for MPC? I can see how heuristics can be found in fairly easy problems. As soon as the problem becomes complex or we approach large T or infinite horizon problems, I guess you want to have heard of approximate dynamic programming.

Or am I missing sth?

Expand full comment

Reply (1)

Ben Recht

Nov 16, 2023Edited

The key thing I came to appreciate is that in optimal control is that the cost is how you design behavior. The dynamics are constraints of reality, but there's a ton of flexibility in the cost.

In this regard, assuming an infinite time horizons with unchanging costs in a model! That's the one approximate dynamic programming assumes (often with the weird addition of a discount factor that's also very much a parameter chosen by the designer).

But there are other options. Heuristics about board strength in chess show us that you can just aim for an advantageous but not optimal final state. Or you can design the terminal cost to enforce safety, so you try to garner as much reward on the time horizon while ending in a decent final condition. There are tons of options at the your disposal as a designer, and which you choose will depend on specifications.

Expand full comment

arg min

Greetings Professor Falken