*This is the fifth part of “An Outsider’s Tour of Reinforcement Learning.” Part 6 is here. Part 4 is here. Part 1 is here.*

The first two parts of this series highlighted two parallel *aspirations* of the current research in RL: part 1 described reinforcement learning as prescriptive analytics and part 2 as optimal control. This post, by contrast, is going to focus on how people typically *use* RL, both in practice and in papers. The reality of RL is often quite different than the rhetoric, and I want to spend time here to separate the two so it will be easier to understand the limits of different methodologies and algorithms.

There are a set of rules that are abided by, and the rules are agreed upon by loose precedent. I want to delineate these rules and then describe the connections to established inquiries in control system design and analysis.

## Trajectories and Policies

Let’s begin by revisiting our abstract dynamical system model

\[x_{t+1} = f( x_t, u_t, e_t)\,.\]Again, $x_t$ is the *state* of the system, $u_t$ is the control action, and $e_t$ is a random disturbance. We’re going to assume that $f$ is fixed, but unknown.

I will refer to a *trajectory* as a sequence of states and control actions generated by this dynamical system.

A *control policy* (or simply “a policy”) is a function, $\pi$, that takes a trajectory from a dynamical system and outputs a new control action. Note that $\pi$ only gets access to previous states and control actions.

For example, in LQR on long time horizons we know that the policy

\[\pi(\tau_t) = K x_t\]will be nearly optimal for a fixed matrix $K$. But arbitrarily complicated policies are possible for general RL problems.

Optimal control and reinforcement learning can be equivalently posed as trying to find the policy that maximizes an expected reward.

## Go ask the oracle

Recall that our main goal in reinforcement learning is to solve the optimal control problem

\[\begin{array}{ll} \mbox{maximize}_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R_t[x_t,u_t] ]\\ \mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\ & \mbox{($x_0$ given).} \end{array}\]But we assume that we don’t know the function $f$. Some lines of work even assume that we don’t know the reward function $R$. I personally feel that not knowing $R$ is unrealistic for any practical problem, and, moreover, that $R$ is actually a design parameter in engineering applications. However, for the purpose of this post, it will make no difference as to whether $R$ is known or unknown.

The important point is that since we can’t solve this optimization problem using standard optimization methods unless we know the function $f$ governing the dynamics. We must learn something about the dynamical system and subsequently choose the best policy based on our knowledge. How do we measure success? We need to balance both the final expected reward of our policy and the number of times we need to interrogate the system to find this policy.

The main paradigm in contemporary RL is to play the following game. We decide on a policy $\pi$ and horizon length L. Then we either pass this policy to a simulation engine or to a real robotic system and are returned a trajectory

\[\tau_L = (u_1,…,u_{L-1},x_0,…,x_L)\,,\]where $u_t = \pi(\tau_t)$. This is our *oracle model*. We typically want to minimize the total number of samples computed by the oracle. So if we were to run $m$ queries with horizon length $L$, we would pay a total cost of $mL$. However, we are free to vary our horizon length for each experiment.

So let’s denote by $n$ the total number of oracle accesses. At the end of the day we want the expected reward to be high for our derived policy, but we also need the number of oracle queries to be small.

Phew. This is already complicated! Note that this framing of the problem makes it very hard to decide on a “best” algorithm. Do we decide an algorithm is best if it achieves some reward in the fewest number of samples? Or is an algorithm best if it achieves the highest rewards given a fixed budget of samples? Or maybe there’s a middle ground? I’ll return to such issues about measuring the relative abilities of different RL methods later in this series.

## Iterative Learning Control

Control theorists have a different name for this RL game. They call it *iterative learning control* (ILC). In ILC, the focus is on designing control systems that perform a repetitive task, and the design is refined by leveraging repetition. A common example is learning to track a trajectory, and the input control is improved by adjustment with respect to the deviation from the desired trajectory in previous iterations. ILC is a useful and mature sub-discipline of control theory and has achieved many industrial success stories. Also, it is not an exaggeration to say that the embodiments of iterative learning control in actual physical systems blow any RL demo out of the water. Here are some insane youtubes. Hopefully you’ll come back and read the rest of this post after you get stuck in a rabbit hole of watching mind boggling quadrotor acrobatics.

Iterative learning control and RL merely differ insofar as what information they provide to the control design engineer. In RL, the problems are constructed to hide as much information about the dynamical system as possible. Even though RL practice uses physics simulators that are generated from well-specified differential equations, we have to tie our hands behind our back pretending like we don’t know basic mechanics and that we don’t understand the desired goals of our control system. As a result, RL schemes require millions of training examples to achieve reasonable performance. ILC on the other hand typically never requires more than a few dozen iterations to exceed human performance. But ILC typically requires reasonable models about the underlying system dynamics, and often assumes fairly well specified dynamics. Is there a middle ground here where we can specify a coarse model but still learn on actual physical systems in a short amount of time?

Understanding this tradeoff between modeling and number of required iterations is a fascinating practical and theoretical challenge. What new insights can be gleaned from comparing and contrasting classical control approaches with reinforcement learning techniques? How well do we need to understand a system in order to control it? In the next few posts, I’ll describe a variety of different approaches to the RL game and how different techniques choose to optimize inside of these rules. We’ll dive in with my least favorite of the bunch Policy Gradient.