*This is the twelfth part of “An Outsider’s Tour of Reinforcement Learning.” Part 13 is here. Part 11 is here. Part 1 is here.*

This series began by describing a view of reinforcement learning as optimal control with unknown costs and state transitions. In the case where everything is known, we know that dynamic programming generically provides an optimal solution. However, when the models and costs are unknown, or when the full dynamic program is intractable, we must rely on approximation techniques to solve RL problems.

How you approximate the dynamic program is, of course, the hard part. Bertsekas recently released a revised version of his seminal book on dynamic programming and optimal control, and Chapter 6 of Volume 2 has a comprehensive survey of data-driven methods to approximate dynamic programming. Though I don’t want to repeat everything Bertsekas covers here, I think describing his view of the problem builds a clean connection to receding horizon control, and bridges the complementary perspectives of classical controls and contemporary reinforcement learning.

## Approximate Dynamic Programming

While I don’t want to belabor a full introduction to dynamic programming, let me try, in as short a space as possible, to review the basics.

Let’s return to our classic optimal control problem:

\[\begin{array}{ll} \mbox{maximize}_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R[x_t,u_t] ]\\ \mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\ & \mbox{($x_0$ given).} \end{array}\]Though we can solve this directly on finite time horizons using some sort of batch solver, there is an often a simpler strategy based on *dynamic programming* and the *principle of optimality*: If you’ve found an optimal control policy for a time horizon of length $N$, $\pi_1,\ldots, \pi_N$, and you want to know the optimal strategy starting at state $x$ at time $t$, then you just have to take the optimal policy starting at time $t$, $\pi_t,\ldots,\pi_N$. Dynamic programming then let’s us recursively find a control policy by starting at the final time and recursively solving for policies at earlier times.

On the infinite time horizon, letting $N$ go to infinity, we get a clean statement of the principle of optimality. If we define $V(x)$ to be the value obtained from solving the optimal control problem with initial condition $x$, then we have

\[V(x) = \max_u \mathbb{E}_{e}\left[R[x,u] + V(f(x,u,e))\right]\,.\]This equation, known as Bellman’s equation, is almost obvious given the structure of the optimal control problem. But it defines a powerful recursive formula for $V$ and forms the basis for many important algorithms in dynamic programming. Also note that if we have a convenient way to optimize the right hand side of this expression, then we can find the optimal action by finding the $u$ that minimizes the right hand side.

Classic reinforcement learning algorithms like TD and Q-learning take the Bellman equation as a starting point, and try to iteratively solve for the value function using data. These ideas also form the underpinnings of now-popular methods like DQN. I’d again highly recommend Bertsekas’ survey describing the many different approaches one can take to approximately solve this Bellman equation. Rather than covering this, I’d like to use this as jumping off point to compare this viewpoint to that of receding horizon control.

## Receding Horizon Control

As we discussed in the previous posts, 95% of controllers are PID control. Of the remaining 5%, 95% of those are probably based on receding horizon control (RHC). RHC, also known as *model predictive control* (MPC), is an incredibly powerful approach to controls that marries simulation and feedback.

In RHC an agent makes a plan based on a simulation from the present until a short time into the future. The agent then executes one step of this plan, and then, based on what it observes after taking this action, returns to short-time simulation to plan the next action. This feedback loop allows the agent to link the actual impact of its choice of action with what was simulated, and hence can correct for model mismatch, noise realizations, and other unexpected errors.

Though I have heard MPC referred to as “classical control” whereas techniques like LSTD and Q-learning are more in the camp of “postmodern reinforcement learning,” I’d like to argue that these are just different variants of approximate dynamic programming.

Note that a perfectly valid expression for the value function $V(x_0)$ is the maximal value of the optimization problem

\[\begin{array}{ll} \max_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R[x_t,u_t] + V(x_{N+1})]\\ \mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\ & \mbox{($x_0$ given).} \end{array}\]Here we have just unrolled the cost beyond one step, but still collect the cost-to-go $N$ steps in the future. Though this is trivial, it is again incredibly powerful: the longer we make the time horizon, the less we have to worry about the value function $V$ being accurate. Of course, now we have to worry about the accuracy of the state-transition map, $f$. But, especially in problems with continuous variables, it is not at all obvious which accuracy is more important in terms of finding algorithms with fast learning rates and short computation times. There is a tradeoff between learning models and learning value functions, and this is a tradeoff that needs to be better understood.

Though RHC methods appear fragile to model mismatch, because they are only as good as the model, the repeated feedback inside RHC can correct for many modeling errors. As an example, it’s very much worth revisiting the robotic locomotion tasks inside the MuJoCo framework. These tasks actually were designed to test the power of a nonlinear RHC algorithm developed by Tassa, Erez, and Todorov.

Here’s a video of such a controller in action from the 2012:

Fast forward to 2:50 to see the humanoid model we discussed in the random search post. Note that the controller works to keep the robot upright, even when the model is poorly specified. Hence, the feedback inside the RHC loop is providing a considerable amount of robustness to modeling errors. Also note that this demo does not estimate the value function at all. Instead, they simply truncate the infinite time-horizon problem. The receding horizon approximation is already quite good for the purpose of control.

Moreover, the video linked above solves for the controller in 7x real time in 2012. Which is really not bad, and probably with a dedicated engineer, this could be made into real time using up-to-date hardware. However, note that in 2013, the same research group published a cruder version of their controller that they used during the DARPA robotics challenge. The video here is just as impressive:

All these behaviors were generated by MPC in real-time. The walking is not as what can be obtained from computationally intensive long-horizon trajectory optimization, but it looks considerably better than the sort of direct policy search gaits we discussed a previous post.

## Learning in RHC

Is there a middle ground between expensive offline trajectory optimization and real time model-predictive control? I think the answer is yes in the very same way that there is middle ground between learning dynamical models and learning value functions. Performance of a receding control system can be improved by better modeling of the value function which defines the terminal cost. The better a model you make of the value function, the shorter a time horizon you need for simulation, and the closer you get to real-time operation. Of course, if you had a perfect model of the value function, you could just solve the Bellman equation and you would have the optimal control policy. But by having an approximation to the value function, high performance can still be extracted in real-time.

So what if we *learn* to iteratively improve the value function while running RHC? This idea has been explored in a project by my Berkeley colleagues Rosolia, Carvalho, and Borrelli. In their “Learning MPC” approach, the terminal cost is learned by nearest neighbors. The terminal cost of a state is the value obtained last time you tried that state. If you haven’t visited that state, the cost is infinite. This formulation constrains the terminal condition to be in a state observed before. You can explore new ways to decrease your cost on the finite time horizon as long as you reach a state that you have already demonstrated is safe.

This nearest-neighbors approach to control works really well in practice. Here’s an demo of the method on an RC car:

After only a few laps, the learned controller works better than a human operator. Simple nearest-neighbors suffices to learn rather complex autonomous actions. And, if you’re into that sort of thing, you can even prove monotonic increase in control performance. Quantifying the actual learning rate remains open and would be a great problem for RL theorists out there to study. But I think this example cleanly shows how the gap between RHC methods and Q-learning methods is much smaller than it first appears.

## Safety While Learning

Another reason to like this blended RHC approach to learning to control is that one can hard code in constraints on controls, states, and easily incorporate models of disturbance directly into the optimization problem. Some of the most challenging problems in control are how to execute safely while continuing to learn more about a system’s capability, and an RHC approach provides a direct route towards balancing safety and performance. In the next post, I’ll describe an optimization-based approach to directly estimate and incorporate modeling errors into control design.