arg min blogMusings on systems, information, learning, and optimization.
http://benjamin-recht.github.io/
An Outsider's Tour of Reinforcement Learning<p>Part 11/x. I’ll keep logging parts here as they come. I am hoping that x ends up being less than 100…</p>
<h2 id="table-of-contents">Table of Contents.</h2>
<ol>
<li><a href="http://www.argmin.net/2018/01/29/taxonomy/">Make It Happen.</a> Reinforcement Learning as prescriptive analytics.</li>
<li><a href="http://www.argmin.net/2018/02/01/control-tour/">Total Control.</a> Reinforcement Learning as Optimal Control.</li>
<li><a href="http://www.argmin.net/2018/02/05/linearization/">The Linearization Principle.</a> If a machine learning algorithm does crazy things when restricted to linear models, it’s going to do crazy things on complex nonlinear models too.</li>
<li><a href="http://www.argmin.net/2018/02/08/lqr/">The Linear Quadratic Regulator.</a> A quick intro to LQR as why it is a great baseline for benchmarking Reinforcement Learning.</li>
<li><a href="http://www.argmin.net/2018/02/14/rl-game/">A Game of Chance to You to Him Is One of Real Skill.</a> Laying out the rules of the RL Game and comparing to Iterative Learning Control.</li>
<li><a href="http://www.argmin.net/2018/02/20/reinforce/">The Policy of Truth.</a> Policy Gradient is a Gradient Free Optimization Method.</li>
<li><a href="http://www.argmin.net/2018/02/26/nominal/">A Model, You Know What I Mean?</a> Nominal control and the power of models.</li>
<li><a href="http://www.argmin.net/2018/03/13/pg-saga/">Updates on Policy Gradients.</a> Can we fix policy gradient with algorithmic enhancements?</li>
<li><a href="http://www.argmin.net/2018/03/20/mujocoloco/">Clues for Which I Search and Choose.</a> Simple methods solve apparently complex RL benchmarks.</li>
<li><a href="http://www.argmin.net/2018/04/19/pid/">The Best Things in Life Are Model Free.</a> PID control and its connection to optimization methods popular in machine learning.</li>
<li><a href="http://www.argmin.net/2018/04/23/ilc/">Catching Signals That Sound in the Dark.</a> PID for iterative learning control.</li>
</ol>
<p><strong>Bonus Post:</strong> <a href="http://www.argmin.net/2018/03/26/performance-profiles">Benchmarking Machine Learning with Performance Profiles</a>. The Five Percent Nation of Atari Champions.</p>
Tue, 24 Apr 2018 00:00:01 +0000
http://benjamin-recht.github.io/2018/04/24/outsider-rl/
http://benjamin-recht.github.io/2018/04/24/outsider-rl/Catching Signals That Sound in the Dark<p>The essence of reinforcement learning is using past data to enhance the future manipulation of a system that dynamically evolves over time. The most common practice of reinforcement learning follows the <em>episodic</em> model, where a set of actions is proposed and tested on a system, a series of rewards and states are observed, and this combination of previous action and reward and state data are combined to improve the action policy. This is a rich and complex model for interacting with a system, and brings with it considerably more complexity than in standard stochastic optimization settings. What’s the right way to use all of the data that’s collected in order to improve future performance?</p>
<p>Methods like policy gradient, random search, nominal control, and Q-learning each transform the reinforcement learning problem into a specific oracle model and then derive their analyses using this model. In policy gradient and random search, we transform the problem into a zeroth-order optimization problem and use this formulation to improve the cost. Nominal control turns the problem into a model estimation problem. But are any of these methods more or less efficient than each other in terms of extracting the most information per sample?</p>
<p>In this post, I’m going to describe an iterative learning control (ILC) scheme that uses past data in an interesting way. And its roots go back to the simple PID controller we discussed in the last post.</p>
<h2 id="pid-control-for-iterative-learning-control">PID control for iterative learning control</h2>
<p>Consider the problem of getting a dynamical system to track a fixed time series. That is, we’d like to construct some control input $\mathbf{u} = [u_1,\ldots,u_N]$ so that the output of the system is as close to $\mathbf{v} = [v_1,\ldots,v_N]$ as possible (I’ll use bold letters to describe sequences). Here’s an approach that looks a lot like reinforcement learning: let’s feedback the error in our tracker to build the next control. We can define the error signal to be the difference $\mathbf{e} = [v_1-y_1, \ldots,v_n-y_N]$. Then let’s denote the discrete integral (cumulative sum) of $\mathbf{e}$ as $\mathcal{S} \mathbf{e}$. And let’s denote the discrete derivative as $\mathcal{D}\mathbf{e}$. Then we can define a PID controller over trajectories as</p>
<script type="math/tex; mode=display">\mathbf{u}_{\mathrm{new}} =
\mathbf{u}_{\mathrm{old}} + k_P \mathbf{e} + k_I \mathcal{S} \mathbf{e} + k_D \mathcal{D} \mathbf{e}\,.</script>
<p>Note that these derivatives and integrals are computed on the sequence $e$, but are not a function of older iterations. In this sense, this particular scheme for ILC is different than classical PID, but it is building upon the same primitives.</p>
<p>This scheme is what most controls researchers think of when they hear the term “iterative learning control.” I like to take a more encompassing view of ILC, <a href="http://www.argmin.net/2018/02/14/rl-game">as I described in a previous post</a>: ILC is any control design scheme where a controller is improved by repeating a task multiple times, and using previous repetitions to improve control performance. In that sense, ILC and episodic reinforcement learning are two different terms for the same problem. But the most classical example of this scheme in controls is the PID-type method I described above.</p>
<p>Note that this is using a ton of information about the previous trajectory to shape the next trajectory. Even though I am designing an open loop policy, I am using far more than reward information alone in constructing the policy.</p>
<p>How well does this work? Let’s use the simple quadrotor model we’ve been using, this time with some added friction to make it a bit more realistic. So the true dynamics will be two independent systems of the form</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
x_{t+1} &= Ax_t + Bu_t\\
y_t &= Cx_t
\end{aligned} %]]></script>
<p>with</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{bmatrix}
1 & 1 \\ 0 & 0.9
\end{bmatrix}\,,~~ B=\begin{bmatrix} 0\\1\end{bmatrix}\,,~~\mbox{and}~~C=\begin{bmatrix} 1 & 0 \end{bmatrix} %]]></script>
<p>Let’s get this system to track a trajectory <em>without using the model</em>. That is, let’s use iterative learning control to learn to track some curve in space without ever knowing what the true model of the system is. To get a target trajectory, I made the following path with my mouse:</p>
<p class="center"><img src="/assets/rl/ilc/target.png" alt="target trajectory" width="240px" /></p>
<p>For ILC, let’s use the PID controller setup above. I’m actually only going to use the derivative term, setting $P_D = 0.1$ and the rest of the terms to $0$. Then I get the following performance for the first 8 iterations.</p>
<p class="center"><img src="/assets/rl/ilc/8_iter.png" alt="8 iterations" width="560" /></p>
<p>And this is what the trajectory looks like after 20 repetitions:</p>
<p class="center"><img src="/assets/rl/ilc/20_iter.png" alt="20 iterations" width="240px" /></p>
<p>Not bad! This converges really quickly, and using all of the state information finds a control policy even without positing a model in very few iterations. Again, the update is the “D”-control update above, and this never uses any knowledge of the true dynamics that govern the system. Amazingly, there is no need for 100K episodes to get this completely model-free method to converge to a quality solution. For the curious, <a href="https://nbviewer.jupyter.org/url/argmin.net/code/ILC_tracker.ipynb">here’s the code to generate these plots in a python notebook</a></p>
<h2 id="stochastic-approximation-in-sheeps-clothing">Stochastic approximation in sheep’s clothing</h2>
<p>Why does this work? In this case, because everything is linear, we can actually analyze the ILC scheme in a simple way. Note that because the dynamics are linear, there is some matrix $\mathcal{F}$ that takes the input and produces the output. That’s what “linear” dynamics means, right?</p>
<p>Also, note that both $\mathcal{S}$ and $\mathcal{D}$ are linear maps so we can think of them as matrices as well. So suppose we knew in advance the optimal control input $u_\star$ such that $v=\mathcal{F} \mathbf{u}_\star$. Then, with a little bit of algebra, we can rewrite the PID iteration as</p>
<script type="math/tex; mode=display">\mathbf{u}_{\mathrm{new}} -\mathbf{u}_\star= \left\{I +(k_P I + k_I \mathcal{S} + k_d \mathcal{D}) \mathcal{F}\right\} (\mathbf{u}_{\mathrm{old}} -\mathbf{u}_\star)\,.</script>
<p>If the matrix in curly brackets has eigenvalues less than $1$, then this iteration converges linearly to the optimal control input. Indeed, with the choice of parameters I used in my examples, I actually made the update map into a contraction mapping, and this explains why the performance looks so good after 8 iterations.</p>
<p>This is a cute instance of <em>stochastic approximation</em> that does not arise from following the gradient of any cost function: we are trying to find a solution of the equation $v = F u$, and our iterative algorithm for doing so uses the classic <a href="https://en.wikipedia.org/wiki/Stochastic_approximation">Robbins-Monro method</a>. But it has a very different flavor than what we typically encounter in stochastic gradients. For the experts out there, the matrix in the parentheses is lower triangular, and hence is never positive definite.</p>
<p>I actually think there a lot of great questions to answer even for this simple linear case: Which dynamics admit efficient ILC schemes? How robust is this method to noise? Can we use this method to solve problems more complex than trajectory tracking? It also shows that there are lots of ways to use your data in reinforcement learning, and there are far more options out there than might appear.</p>
Tue, 24 Apr 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/04/24/ilc/
http://benjamin-recht.github.io/2018/04/24/ilc/The Best Things in Life Are Model Free<p><em>This is the tenth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 11 is <a href="http://www.argmin.net/2018/04/24/ilc/">here</a>. Part 9 is <a href="http://www.argmin.net/2018/03/20/mujocoloco/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>Though I’ve spent the last few posts casting shade at model-free methods for reinforcement learning, I am not blindly against the model-free paradigm. In fact, the most popular methods in core control systems are model free! The most ubiquitous control scheme out there is PID control, and PID has only three parameters. I’d like to use this post to briefly describe PID control, explain how it is closely connected to many of the most popular methods in machine learning, and then turn to explain what PID brings to the table over the model-free methods that drive contemporary RL research.</p>
<h2 id="pid-in-a-nutshell">PID in a nutshell</h2>
<p>PID stands for “proportional integral derivative” control. The idea behind PID control is pretty simple: suppose you have some dynamical system with a single input that produces a single output. In controls, we call the system we’d like to control <em>the plant</em>, a term that comes from chemical process engineering. Let’s say you’d like the output of your plant to read some constant value $y_t = v$. For instance, you’d like to keep the water temperature in your espresso machine at precisely <a href="http://espressovivace.com/education/espresso-tips/">203 degrees Fahrenheit</a>, but you don’t have a precise differential equation modeling your entire kitchen. PID control works by creating a control signal based on the error $e_t=v-y_t$. As the name implies, the control signal is a combination of error, its derivative, and its integral:</p>
<script type="math/tex; mode=display">u_t = k_P e_t + k_I \int_0^t e_s ds + k_D \dot{e}_t\,.</script>
<p>I’ve heard differing accounts, but somewhere in the neighborhood of <a href="https://pdfs.semanticscholar.org/5d1a/2f4b06bc4e5714be1948099c2cb7b3236d42.pdf#page=177">95 percent</a> of all control systems are PID. And some suggest that the number of people using the “D” term is negligible. Something like 95 percent of the myriad collection of control processes that keep our modern society running are configured by setting <em>two</em> parameters. This includes those <a href="https://home.lamarzoccousa.com/history-of-the-pid/">third wave espresso machines</a> that fuel so much great research.</p>
<p class="center"><img src="/assets/rl/pid/silvia-pid.jpg" alt="get that temp stable" height="240px" />
<img src="/assets/rl/pid/PIDGraph.png" alt="oscillating" height="240px" /></p>
<p>In some sense, PID control is the “gradient descent” of control: it solves most problems and fancier methods are only needed for special cases. The odd thing about statistics and ML research these days is that everyone knows about gradient descent, but almost none of the ML researchers I’ve spoken to know anything about PID control. So perhaps to explain the ubiquity of PID control to the ML crowd, it might be useful to establish some connections to gradient descent.</p>
<h2 id="pid-in-discrete-time">PID in discrete time</h2>
<p>Before we proceed, let’s first make the PID controller digital. We all design their controllers in discrete time rather than continuous time since we do things on computers. How can we discretize the PID controller? First, we can compute the integral term with a running sum:</p>
<script type="math/tex; mode=display">w_{t+1} = w_t + e_t</script>
<p>When $w_0=0$, then $w_t$ is the sum of the sequence $e_s$ for $s<t$.</p>
<p>The derivative term can be approximated by finite differences. But since taking derivatives can amplify noise, most practical PID controllers actually filter the derivative term to damp this noise. A simple way to filter the noise is to let the derivative term be a running average:</p>
<script type="math/tex; mode=display">v_{t} = \beta v_{t-1} + (1-\beta)(e_t-e_{t-1})\,.</script>
<p>Putting everything together, a PID controller in discrete time will take the form</p>
<script type="math/tex; mode=display">u_t = k_P e_t + k_I w_t + k_D v_t</script>
<h2 id="integral-control">Integral Control</h2>
<p>Let’s now look at pure integral control. We can simplify the controller in this case to one simple update formula:</p>
<script type="math/tex; mode=display">u_t = u_{t-1}+k_i e_t\,.</script>
<p>This should look very familiar to all of my ML friends out there as it looks an <em>awful lot</em> like gradient descent. To make the connection crisp, suppose that the plant we’re trying to control takes an input $u$ and then spits out the output $y= f’(u)$ for some fixed function $f$. If we want to drive $y_t$ to zero, then the error signal $e$ takes the form $e = -f’(u)$. With this model of the plant, integral control <em>is</em> gradient descent. Just like in gradient descent, integral control can never give you the wrong answer. If you converge to a constant value of the control parameter, then the error must be zero.</p>
<h2 id="proportional-integral-control">Proportional Integral Control</h2>
<p>As discussed above, PI control is the most ubiquitous form of control. For optimization, it is less common, but still finds a valid algorithm when $e = -f’(u)$.</p>
<p>Doing a variable substitution, the $PI$ controller will take the form</p>
<script type="math/tex; mode=display">u_{t+1} = u_t + (k_I-k_P) e_t + k_P e_{t+1}</script>
<p>If $e_t = -f’(u_t)$, then we get the algorithm:</p>
<script type="math/tex; mode=display">u_{t+1} + k_P f'(u_{t+1}) = u_t - (k_I-k_P) f'(u_t)</script>
<p>This looks a bit tricky as somehow we need to compute the gradient of $f$ at our current time step. However, optimization friends out there will note that this equation is the optimality conditions for the algorithm</p>
<script type="math/tex; mode=display">u_{t+1} = \mathrm{prox}_{k_P f} ( u_t - (k_I-k_P) f'(u_t) )\,.</script>
<p>Hence, PI control combines a gradient step with a proximal step. The algorithm is a hybrid between the classical proximal point method and gradient descent. Note that if this method converges, it will again converge to a point where $f’(u)=0$.</p>
<h2 id="proportional-integral-derivative-control">Proportional Integral Derivative Control</h2>
<p>The master algorithm is PID. What happens here? Allow me to do a clever change of variables that <a href="http://www.laurentlessard.com/">Laurent Lessard</a> showed to me. Define the auxiliary variable</p>
<script type="math/tex; mode=display">x_t = \frac{1}{1-\beta}w_t+\frac{\beta}{(1-\beta)^3}v_t-\frac{\beta}{(1-\beta)^2}e_t\,.</script>
<p>In terms of this new hidden state, $x_t$, the PID controller reduces to the tidy set of equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
x_{t+1} &= (1+\beta)x_t -\beta x_{t-1} + e_t\\
u_t &= C_1 x_t + C_2 x_{t-1}+ C_3 e_t\,,
\end{aligned} %]]></script>
<p>and the coefficients $C_i$ are given by the formulae:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
C_1 &= -(1-\beta)^2 k_D+k_I\\
C_2 &= (1-\beta)^2 k_D-\beta k_I\\
C_3 &= k_P + (1-\beta) k_D
\end{aligned} %]]></script>
<p>The $x_{t}$ sequence looks like a <em>momentum</em> sequence used in optimization. Indeed, with proper settings of the gains, we can recover a variety of algorithms that we commonly use in machine learning. <em>Gradient descent with momentum</em> with learning rate $\alpha$—-also known as the <em>Heavy Ball method</em>—is realized with the settings.</p>
<script type="math/tex; mode=display">k_I = \frac{\alpha}{1-\beta}\,, ~~~~
k_D=\frac{\alpha \beta}{(1-\beta)^3} \,, ~~~~
k_P = \frac{-\alpha \beta}{(1-\beta)^2}</script>
<p>Nesterov’s accelerated method pops out when we set</p>
<script type="math/tex; mode=display">k_I = \frac{\alpha}{1-\beta} ~~~~ k_D=\frac{\alpha \beta^2}{(1-\beta)^3}~~~~ k_P = \frac{-\alpha \beta^2}{(1-\beta)^2}</script>
<p>These are remarkably similar, differing only in the power of $\beta$ in the numerator of the proportional and derivative terms.</p>
<h2 id="the-lure-problem">The Lur’e Problem</h2>
<p>Laurent blew my mind when he showed me the connection between PID control and optimization algorithms. How crazy is it that most of the popular algorithms in ML end up being special cases of PID control? And I imagine that if we went out and did surveys of industrial machine learning, we’d find that 95% of the machine learning models in production were trained using some sort of gradient descent. Hence, there’s yet another feather in the cap for PID.</p>
<p>It turns out that the problem of feedback with a static, nonlinear map has a long history in controls, and this problem even has a special name: <a href="https://en.wikipedia.org/wiki/Nonlinear_control#Nonlinear_feedback_analysis_%E2%80%93_The_Lur'e_problem">the Lur’e problem</a>. Finding a controller to push a static nonlinear system to a fixed point turns out to be identical to designing an optimization algorithm to set a gradient to zero.</p>
<p class="center"><img src="/assets/rl/pid/lureloop.png" alt="parallels between optimization and control" width="560px" /></p>
<p>Laurent Lessard, Andy Packard, and I made these connections in <a href="https://arxiv.org/abs/1408.3595">our paper</a>, showing that many of the rates of convergence for optimization algorithms could be derived using stability techniques from controls. We also used this approach to show that the Heavy Ball method might not always converge at an accelerated rate, justifying why we need the slightly more complicated Nesterov accelerated method for reliable performance. Indeed, we found settings where the Heavy Ball method for quadratics converged linearly, but on general convex functions didn’t converge at all. Even though these methods barely differ from each other in terms of how you set the parameters, this subtle change is the difference between convergence and oscillation!</p>
<p class="center"><img src="/assets/rl/pid/hbcycle.png" alt="Heavy Ball isn’t stable" width="560px" /></p>
<p>With Robert Nishihara and Mike Jordan, we followed up this work showing that you could even use this to <a href="https://arxiv.org/abs/1502.02009">study ADMM using the connections between prox-methods and proportional integral control</a>. Bin Hu, Pete Seiler, and Anders Rantzer <a href="https://arxiv.org/abs/1706.08141">generalized this technique to better understand stochastic optimization methods</a>. And Laurent and Bin <a href="https://arxiv.org/abs/1703.01670">made the formal connections to PID control</a> that I discuss in this post.</p>
<h2 id="learning-to-learn">Learning to learn</h2>
<p>With the connection to PID control in mind, we can think of learning rate tuning as controller tuning. The Nichols-Ziegler rules (developed in the forties) simply find the largest gain $k_P$ such that the system oscillates, and set the PID parameters based on this gain and the frequency of the oscillations. A common trick for gradient descent tuning is to find the largest value such that gradient descent does not diverge, and then set the momentum and learning rate accordingly from this starting point.</p>
<p>Similarly, we can think of the “learning to learn” paradigm in machine learning as a special case of controller design. Though PID works for most applications, it’s possible that a more complex controller will work for a particular application. In the same vein, it’s always possible that there’s something better than Nesterov’s method if you restrict your set of instances. And maybe you can even find this controller by gradient descent. But it’s always good to remember, 95% is still PID.</p>
<p>I make these connections for the following reason: both in the case of gradient descent and PID control, we can only prove reasonable behavior in rather constrained settings: in PID we understand how to analyze certain nonlinear control systems, but not all of them. In optimization, we understand the behavior on convex functions and problems that are “nearly” convex. Obviously, we can’t hope to have simple methods to stabilize <em>all</em> possible plants/functions (or else we’re violating some serious conjectures in complexity theory), but we can show that our methods work on simple cases, and performance degrades gracefully as we add complexity to the problem.</p>
<p>Moreover, the simple cases give us a body of techniques for general design: by developing theory on specific cases we can developing intuition and probe fundamental limits. I think the same thing needs to be established for general reinforcement learning, and it’s why I’ve been spending so much time on LQR and nearby generalizations.</p>
<p>Let’s take this perspective for PID. Though PID is a powerful workhorse, it is typically thought of to only be useful for simple low-level control loops attempting to maintain some static equilibrium. It seems like it’s not particularly useful for more complex tasks like robotic acrobatics. However, <a href="http://www.argmin.net/2018/04/24/ilc/">in the next post, I will describe a more complex control task that can also be solved by PID-type techniques.</a></p>
Thu, 19 Apr 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/04/19/pid/
http://benjamin-recht.github.io/2018/04/19/pid/The Ethics of Reward Shaping<p>I read three great articles over the weekend by <a href="http://twitter.com/noUpside">Renee DiResta</a>, <a href="http://www.columbia.edu/~chw2/">Chris Wiggins</a>, and <a href="https://twitter.com/janellecshane">Janelle Shane</a> that touched on a topic that’s been troubling me: In machine learning, we take our cost functions for granted, amplifying feedback loops with horrible unintended consequences.</p>
<p>First, Renee DiResta makes a great case <a href="https://www.wired.com/story/creating-ethical-recommendation-engines/">for a complete reinvention of how we design and deploy recommendation engines</a>. Recommender Systems always seemed like an innocuous and low-stakes ML application. What harm could come from improving music systems to tell people they might like more than the Beatles, or improving the suggestions on a streaming service like Netflix? They might improve the user experience a little bit, but probably would never amount to much. This assessment couldn’t have been more wrong: as Zeynep Tufekci summarizes: recommendation systems have become the internet’s <a href="https://www.nytimes.com/2018/03/10/opinion/sunday/youtube-politics-radical.html">“Great Radicalizer”</a>, focusing minds on ever-increasingly extreme content to keep them hooked on websites.</p>
<p>DiResta argues that we have to change the cost function we optimize to bring recommender systems in line with ethical guidelines. Optimizing time spent is clearly the wrong objective. I know that engineers are not deliberately trying to incite rage and panic in their user base, but the signals they use to evaluate user happiness are completely broken. “Time on the website” is not the right performance indicator. But what exactly is the right way to quantify “user happiness?” This is super hard to make into a cost function for an optimization problem, as Chris Wiggins lays out in his <a href="http://datascience.columbia.edu/ethical-principles-okrs-and-kpis-what-youtube-and-facebook-could-learn-tukey">thoughtful blog post</a>. Wiggins argues that we can never construct the correct cost function, but we can iteratively design the cost to match ethical concerns. Wiggins suggests that industrial applications that face humans should consider the same principles as academic researchers working with human subjects, laid out in the famous <a href="https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html">Belmont Report</a>. Once we set these guidelines as gold standards, engineers can treat these standards as design principles for shippable code. We can constantly refine and improve our models to make sure they adhere to these principles.</p>
<h2 id="shaping-rewards-is-hard">Shaping rewards is hard</h2>
<p>I can’t emphasize enough that even in “hard engineering” that doesn’t involve people, designing cost functions is a major challenge and tends to be an art form in engineering. Janelle Shane wrote a <a href="http://aiweirdness.com/post/172894792687/when-algorithms-surprise-us">creative and illuminating blog</a> on how “AI systems” that are designed to optimize cost functions often surprise us with unexpected behavior that we didn’t think to discount. Shane highlights several particularly bizarre examples of systems that fall over rather than walk, or force adversaries into segmentation faults. The underlying issue in all of these problems is that if we define the reward function too loosely and don’t add the correct safety constraints, optimized systems will frequently take surprising and unwanted paths to optimality.</p>
<p>This is indeed a question that underlies my series on reinforcement learning. We saw this phenomenon in the <a href="http://www.argmin.net/2018/03/20/mujocoloco/">post about locomotion in MuJoCo</a>. In the Open AI Gym, humanoid walking is declared “solved” if the reward value exceeds 6000. This lets you just look at scores (as if you’re a gamer or a day trader on wall street), and completely ignore anything you might know about robotics. If the number is high enough, you win. But I showed a bunch of gaits that achieve the target reward, and none of these look like plausible actions that could happen in the physical world. All of them have overfit to defects in the simulation engine that are unrealistic.</p>
<p>It’s also rather unclear what the right reward function is for walking. There are so many things that we value in a walking robot. But these values are modeling assumptions and are often not correct in retrospect. In order to get any optimization-based framework to output realistic locomotion, cost functions have to be defined iteratively until the behavior matches as many of our expectations as possible.</p>
<h2 id="ml-systems-are-now-rl-systems">ML systems are now RL systems</h2>
<p>Though it’s not obvious, Shane’s surprising optimizers are closely connected to the bad behavior of recommender systems highlighted by DiResta and Wiggins. <strong>As soon as a machine learning system is unleashed in feedback with humans, that system is a reinforcement learning system, not a machine learning system.</strong></p>
<p>This poses a major challenge to the ML community, and it’s why I’ve shifted my academic focus to strongly to RL. Supervised learning tells essentially nothing about how to deal with changing distributions, gaming, adversarial behavior, and unexpected amplification. We’re at the point now where all machine learning is reinforcement learning, and yet we don’t understand reinforcement learning at all! This is a huge issue that we all have to tackle if we want our learning systems to be trustable, predictable, and safe.</p>
<h2 id="reward-shaping-is-not-a-dirty-word">Reward shaping is not a dirty word</h2>
<p>Cost function design is a major challenge in throughout engineering. And it’s a major challenge when establishing laws and policy as well. Across a variety of disciplines: performance indicators must be refined iteratively until the behavior matches our desiderata.</p>
<p>And ethical standards can be part of this desiderata. James Grimmelmann put it well <a href="https://www.washingtonpost.com/news/the-switch/wp/2018/04/11/ai-will-solve-facebooks-most-vexing-problems-mark-zuckerberg-says-just-dont-ask-when-or-how/">“Kicking the question over to AI just means hiding value judgments behind the AI.”</a> ML engineers have to accept that their engineering has moral and ethical outcomes, and hence they must design with these outcomes in mind. Algorithms can be tuned to match our societal values, and it’s time for our community to achieve a consensus on how.</p>
Mon, 16 Apr 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/04/16/ethical-rewards/
http://benjamin-recht.github.io/2018/04/16/ethical-rewards/Benchmarking Machine Learning with Performance Profiles<p>A common sticking point in contemporary reinforcement learning is how to evaluate performance on benchmarks. For a general purpose method, we’d like to demonstrate aptitude on a wide selection of test problems with minimal special case tuning. A great example of such a suite of test problem is the Arcade Learning Environment (ALE) of Atari benchmarks. How can we tell when an algorithm is “state-of-the-art” on Atari? Clearly, we can’t just excel on one game. There are 60 games, and even careful comparisons end in impenetrable tables with 60 rows and multiple columns. Moreover, the performance is a random variable as the methods are evaluated over many random seeds, so there are inherent uncertainties in the reported numbers. How can we summarize the performance over such a large number of noisy benchmarks?</p>
<h2 id="performance-profiles">Performance Profiles</h2>
<p>My favorite way to aggregate benchmarks was proposed by <a href="https://arxiv.org/abs/cs/0102001">Dolan and More</a> and called <em>performance profiles</em>. The idea here is very simple. We want a way of depicting how frequently is a particular method within some distance of the best method for a particular problem instance. To do so, we make statistics. Let’s suppose we have a suite of $n_p$ problem instances and we want to find the best performing method across all of these instances.</p>
<p>For each problem instance, we compute the best method, and then for every other method, we determine how far they are from optimal. This requires some notion of “far from optimality.” Let’s denote $d[m,p]$ the distance from optimality of method m on problem p.
We then count on how many problem instances a particular method is within a factor of tau of the optimal. That is, we compute</p>
<script type="math/tex; mode=display">% <![CDATA[
\rho_m(\tau) = \frac{1}{n_p} \left| \{p~:~d[m,p] < \tau \}\right|\,. %]]></script>
<p>That is, we compute the fraction of problems where method m has distance from optimality less than tau.</p>
<p>A performance profile plots $\rho_m(\tau)$ for each method m. Performance profiles provide a visually striking way to immediately eyeball differences in performance between a set of candidate methods over a suite of benchmarks. They let you easily read off the percentage of times a method is within some set range of optimal across the suite of benchmarks. Moreover, they have several nice properties: performance profiles are robust to outlier problems. They are also robust to small changes in performance across all problems. Performance profiles allow a holistic view of performance without having to single out the idiosyncrasies of particular instances.</p>
<p>The canonical application for performance profiles is for comparing solve times of different optimization methods. In this case, distance from optimality will be the ratio of the time a solver takes to the time taken by the fastest on a particular instance. The original Dolan and More paper has several examples showing that performance profiles cleanly delineate aggregate differences in run times for different solvers. They are now a widely adopted convention for comparing optimization methods. As we will now see, performance profiles also provide a straightforward way to compare relative rewards in reinforcement learning problems.</p>
<h2 id="is-deep-rl-better-than-handcrafted-representations-on-atari">Is Deep RL better than handcrafted representations on Atari?</h2>
<p>Let’s apply performance profiles to understand the power of deep reinforcement learning on Atari games. One of my favorite deep reinforcement learning papers is <a href="https://arxiv.org/abs/1709.06009">“Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents”</a> by Machado et al. which proposes several guidelines for conducting careful evaluations of methods on the ALE benchmark suite. When put on the same footing under their evaluation framework, DQN doesn’t look to be that much better than SARSA (a simple method for Q-learning with function approximation) and hand crafted features.</p>
<p>Nonetheless, the authors concede that “Despite this high sample complexity, DQN and DQN-like approaches remain the best performing methods overall when compared to simple, hand-coded representations.” But it’s hard to tell how much better DQN is. The evaluations are stochastic, and since DQN is costly, they only evaluate it’s performance on 5 random seeds and report the mean and standard deviation.</p>
<p>I downloaded the source of the Machado paper and parsed the results tables into a CSV file<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. This table lists the mean reward and standard deviation for each game evaluated. Not only are the rewards here random variables, but directly comparing the means is difficult because the rewards are all on completely different scales.</p>
<p>To attempt to address both the stochasticity and the varied scaling of of the rewards, I decided to use p-values from the Welsh t-test. That is $d[m,p]$ is the negative log probability that method $m$ has a higher score than the best method on problem $p$ under the assumptions of the Welch t-test. For the best performing method, I assign $d[m,p]=0$.</p>
<p>Now, this is a <em>very</em> imperfect measure. T-tests are assuming Gaussian distributions, and that’s clearly not going to be legitimate. But it’s not a terrible comparison when we are only provided means and variances. And, frankly, the community might want to consider releasing more finely detailed reports of their experiments if they would like better evaluation of the relative merits of methods. For example, if researchers simply released the raw scores for all runs, we could try more sophisticated nonparametric rank tests.</p>
<p>Let’s leave the imperfection aside for a moment, and plot a performance profile based on these likelihoods.I computed a standard performance profile for the ALE benchmark suite, plotting the frequency of the time that the p-values are greater than some threshold $\tau$. The results are here:</p>
<p class="center"><img src="/assets/rl/perfprof/perf_prof.png" alt="you are all crazy, shallow learning is as good as deep learning for atari" width="480px" /></p>
<p>For any $x$ value, the $y$-value is the number of instances where a method either has the highest mean or where we cannot reject the null hypothesis that the method has the highest mean with confidence $\tau$. You might look at this plot and think “that’s completely unreadable as the curves are on top of each other.” When performance profiles intersect each other multiple times, it means the algorithms are effectively equivalent to each other: there is no value of $\tau$ where DQN or Blob-PROST are more frequently scoring higher than the other. To see an example of curves where things are way off, consider Blob-Prost with 200M simulations vs DQN with 10M simulations:</p>
<p class="center"><img src="/assets/rl/perfprof/perf_prof2.png" alt="these two algorithms are not the same" width="480px" /></p>
<p>Now there is a clear separation in the performance profiles, and it’s clear that BlobProst 200M is much better than DQN 10M. This shouldn’t be surprising as I’m letting BlobProst see 20x as many samples. But it does suggest that DQN and Blob-PROST when given the same sample allocation are essentially indistinguishable methods. My take away from this plot is that Machado et al. concede too much in their discussion: <strong>simple methods and hand crafted features match the performance of DQN on the ALE.</strong></p>
<h2 id="to-establish-dominance-provide-more-evidence">To establish dominance, provide more evidence.</h2>
<p><a href="https://twitter.com/Miles_Brundage/status/977512294824341504">Miles Brundage</a> suggests that there are far better baselines now (from the DeepMind folks). I’d like to make the modest suggestion that someone at DeepMind adopt the Machado et al. evaluation protocol for these new, more sophisticated methods, and then report means and standard deviations on all of the games. Even better, why not report the actual values over the runs so we could use non-parametric test statistics? Or even better, why not release the code? I’d be happy to make a performance profile again so we can see how much we’re improving.</p>
<p>If you are interested in changing the performance metric or running performance profiles on your own data, here’s a <a href="https://nbviewer.jupyter.org/url/argmin.net/code/atari_performance_profiles.ipynb">Jupyter notebook</a>. that lets you recreate the above plots.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There was no data for Blob-PROST on Journey Escape with 200M samples, so I used the values listed for 100M samples. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 26 Mar 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/03/26/performance-profiles/
http://benjamin-recht.github.io/2018/03/26/performance-profiles/Clues for Which I Search and Choose<p><em>This is the ninth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 10 is <a href="http://www.argmin.net/2018/04/19/pid/">here</a>. Part 8 is <a href="http://www.argmin.net/2018/03/13/pg-saga/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>Before we leave these model-free chronicles behind, let me turn to the converse of the Linearization Principle. We have seen that random search works well on simple linear problems and appears better than some RL methods like policy gradient. Does random search break down as we move to harder problems? <strong>Spoiler Alert: No.</strong> But keep reading!</p>
<p>Let’s apply random search to problems that are of interest to the RL community. The deep RL community has been spending a lot of time and energy on a suite of benchmarks, maintained by <a href="https://gym.openai.com/envs/#mujoco">OpenAI</a> and based on the <a href="http://www.mujoco.org/">MuJoCo</a> simulator. Here, the optimal control problem is to get the simulation of a legged robot to walk as far and quickly as possible in one direction. Some of the tasks are very simple, but some are quite difficult like the complicated humanoid models with 22 degrees of freedom. The dynamics of legged robots are well-specified by Hamiltonian Equations, but planning locomotion from these models is challenging because it is not clear how to best design the objective function and because the model is piecewise linear. The model changes whenever part of the robot comes into contact with a solid object, and hence a normal force is introduced that was not previously acting upon the robot. Hence, getting robots to work without having to deal with complicated nonconvex nonlinear models seems like a solid and interesting challenge for the RL paradigm.</p>
<p>Recently, <a href="https://arxiv.org/abs/1703.03864">Salimans and his collaborators at Open AI</a> showed that random search worked quite well on these benchmarks. In particular, they fit neural network controllers using random search with a few algorithmic enhancements (They call their version of random search “Evolution Strategies,” but I’m sticking with my naming convention). In another piece of great work, <a href="https://arxiv.org/abs/1703.02660">Rajeswaran et al</a> showed that Natural Policy Gradient could learn <em>linear</em> policies that could complete these benchmarks. That is, they showed that static linear state feedback, like the kind we use in LQR, was also sufficient to control these complex robotic simulators. This of course left an open question: can simple random search find linear controllers for these MuJoCo tasks?</p>
<p>My students Aurelia Guy and Horia Mania tested this out, coding up a rather simple version of random search (the one from lqrpols.py in my previous posts). Surprisingly (or not surprisingly), this simple algorithm learns linear policies for the Swimmer-v1, Hopper-v1, HalfCheetah-v1, Walker2d-v1, and Ant-v1 tasks that achieve the reward thresholds previously proposed in the literature. Not bad!</p>
<p class="center"><img src="/assets/rl/mujoco/ars_v1.png" alt="random search attempt 1" width="560px" /></p>
<p>But random search alone isn’t perfect. Aurelia and Horia couldn’t get the humanoid model to do anything interesting at all. Having tried a lot of parameter settings, they decided to try to enhance random search to get it to train faster. Horia noticed that a lot of the RL papers were using statistics of the states and whitening the states before passing them into the neural net that defined the mapping from state to action. So he started to keep online estimates of the states and whiten them before passing them to the linear controller. And voila! With this simple trick, Aurelia and Horia now get state-of-the-art performance on Humanoid. Indeed, they can reach rewards over 11000 which is higher than anything I’ve seen reported. It is indeed almost twice the “success threshold” that was used for benchmarking by Salimans et al. Linear controller. Random search. One simple trick.</p>
<p class="center"><img src="/assets/rl/mujoco/ars_v1_v2.png" alt="random search attempt 2" width="560px" /></p>
<p>What’s nice about having something this simple is that the code is 15x faster than what is reported in the OpenAI Evolution Strategies paper. We can obtain higher rewards <em>with less computation.</em> One can train a high performing humanoid model in under an hour on a standard EC2 instance with 18 cores.</p>
<p>Now, with the online state updating, random search not only exceeds state-of-the-art on Humanoid, but also on Swimmer-v1, Hopper-v1, HalfCheetah-v1. But it isn’t yet as good on Walker2d-v1 and Ant-v1. But we can add one more trick to the mix. We can drop the sampled directions that don’t yield good rewards. This adds a hyperparameter (which fraction of directions to keep), but with this one additional tweak, random search can actually match or exceed the state-of-the-art performance of all of the MuJoCo baselines in the OpenAI gym. Note here, I am not restricting comparisons to policy gradient. As far as I know from our literature search, these policies are better than any results that apply model-free RL to the problem, whether it be an Actor Critic Method, a Value Function Estimation Method, or something even more esoteric. It does seem like pure random search is better than deep RL and neural nets for these MuJoCo problems.</p>
<p class="center"><img src="/assets/rl/mujoco/ars_v1_v2_v2t.png" alt="random search final attempt" width="560px" /></p>
<p>Random search with a few minor tweaks outperforms all other methods on these MuJoCo tasks and is significantly faster. We have a full paper with these results and more <a href="https://arxiv.org/abs/1803.07055">here</a>. And our code is <a href="https://github.com/modestyachts/ARS">in this repo</a>, though it is certainly easy enough to code up for yourself.</p>
<h2 id="what-can-reinforcement-learning-learn-from-random-search">What can reinforcement learning learn from random search?</h2>
<p>There are a few of important takeaways here.</p>
<h4 id="benchmarks-are-hard">Benchmarks are hard.</h4>
<p>I think the only reasonable conclusion from all of this is that these MuJoCo demos are easy. There is nothing wrong with that. But it’s probably not worth deciding NIPS, ICML, <em>or</em> ICLR papers over performance on these benchmarks anymore. This does leave open a very important question: <em>what makes a good benchmark for RL?</em>. Obviously, we need more than the Mountain Car. I’d argue that <a href="http://www.argmin.net/02/26/nominal">LQR with unknown dynamics</a> is a reasonable task to master as it is easy to specify new instances and easy to understand the limits of achievable performance. But the community should devote more time to understanding how to establish baselines and benchmarks that are not easily gamed.</p>
<h4 id="never-put-too-much-faith-in-your-simulators">Never put too much faith in your simulators.</h4>
<p>Part of the reason why these benchmarks are easy is that MuJoCo is not a perfect simulator. MuJoCo is blazingly fast, and is great for proofs of concept. But in order to be fast, it has to do some smoothing around the contacts (remember, discontinuity at contacts is what makes legged locomotion hard). Hence, just because you can get one of these simulators to walk, doesn’t mean that you can get an actual robot to walk. Indeed, here are four gaits that achieve the magic 6000 threshold. None of these look particularly realistic:</p>
<p class="center"><img src="/assets/rl/mujoco/pegleg.gif" alt="watch me hop" width="250px" />
<img src="/assets/rl/mujoco/ice.gif" alt="triple axel" width="250px" /></p>
<p class="center"><img src="/assets/rl/mujoco/backwards.gif" alt="moon walk" width="250px" />
<img src="/assets/rl/mujoco/cancan.gif" alt="on broadway" width="250px" /></p>
<p>even the top performing model (reward 11,600) looks like a very goofy gait that might not work in reality:</p>
<p class="center"><img src="/assets/rl/mujoco/reward_11600.gif" alt="run away" width="250px" /></p>
<h4 id="strive-for-algorithmic-simplicity">Strive for algorithmic simplicity.</h4>
<p>Adding hyperparameters and algorithmic widgets to simple algorithms can always improve their performance on a small enough set of benchmarks. I don’t know if dropping top-performing directions or state normalization will work on a new random search problem, but it worked for these MuJoCo benchmarks. Higher rewards might even be achieved by adding more adding tunable parameters. If you add enough bells and whistles, you can probably convince yourself that any algorithm works for a small enough set of benchmarks.</p>
<h4 id="explore-before-you-exploit">Explore before you exploit.</h4>
<p>Note that since our random search method is fast, we can evaluate its performance on many random seeds. These model-free methods all exhibit alarmingly high variance on these benchmarks. For instance, on the humanoid task, the the model is slow to train almost a quarter of the time even when supplied with what we thought were good parameters. And for those random seeds it finds rather peculiar gaits. It’s often very misleading to restrict one’s attention to 3 random seeds for random search, because you may be tuning your performance to peculiarities of the random number generator.</p>
<p class="center"><img src="/assets/rl/mujoco/humanoid_100seeds_med.png" alt="such variance" width="560px" /></p>
<p>This sort of behavior arose in LQR as well. We can tune our algorithm for a few random seeds, and then see completely different behavior on new random seeds. <a href="https://arxiv.org/abs/1709.06560">Henderson <em>and et</em></a> observed this phenomenon already with Deep RL methods, but I think that such high variability will be a symptom of all model-free methods. There are simply too many edge cases to account for through simulation alone. As I said in <a href="http://www.argmin.net/03/13/pg-saga">the last post</a>:
“<em>By throwing away models and knowledge, it is never clear if we can learn enough from a few instances and random seeds to generalize.</em>”</p>
<h2 id="i-cant-quit-model-free-rl">I can’t quit model-free RL.</h2>
<p>In a future post, I’ll have one more nit to pick with model-free RL. This is actually a nit I’d like to pick with all of reinforcement learning and iterative learning control: what exactly do we mean by “sample complexity?” What are we learning as a community from this line of research of trying to minimize sample complexity on a small number of benchmarks? And where do we, as a research community, go from here?</p>
<p>Before we get there though, let me take a step back to <a href="http://www.argmin.net/04/19/pid">assess some variants of model-free RL that both work well in theory and practice</a> and see if these can be extended to the more challenging problems currently of interest to the machine learning community.</p>
Tue, 20 Mar 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/03/20/mujocoloco/
http://benjamin-recht.github.io/2018/03/20/mujocoloco/Updates on Policy Gradients<p><em>This is the eighth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 9 is <a href="http://www.argmin.net/2018/03/20/mujocoloco/">here</a>. Part 7 is <a href="http://www.argmin.net/2018/02/26/nominal/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>I’ve been swamped with a bit of a travel binge and am hopelessly behind on blogging. But I have updates! This should tide us over until next week.</p>
<p>After my last post on nominal control, I received an email from Pavel Christof pointing out that if we switch from stochastic gradient descent to Adam, policy gradient works <em>much</em> better. Indeed, I implemented this myself, and he’s totally right. Let’s revisit the last post with a revised <a href="https://nbviewer.jupyter.org/url/argmin.net/code/lqr_policy_comparisons.ipynb">Jupyter notebook</a>.</p>
<p>First, I coded up Adam in pure python to avoid introducing any deep learning package dependencies (it’s only 4 lines of python, after all). Second, I also fixed a minor bug in my random search code which improperly scaled the search direction. Now if we look again at the median performance, we see this:</p>
<p class="center"><img src="/assets/rl/policies/cost_finite_err_bars_update.png" alt="adam roolz" width="410px" /></p>
<p>Policy gradient looks <em>a lot</em> better! It’s still not as good as pure random search, but it’s close. And, as Pavel pointed out, we can remove the annoying “clipping” to the $[-2,2]$ hypercube I needed to get Policy Gradient to appear to converge. Of course, both are still worse than uniform sampling and orders of magnitude worse than nominal control. On the infinite time horizon, the picture is similar:</p>
<p class="center"><img src="/assets/rl/policies/cost_infinite_update.png" alt="sgd droolz" width="560px" /></p>
<p>Policy gradient still hiccups with some probability, but is on average only a bit worse than random search at extrapolation.</p>
<p>This is great. Policy gradient can be fixed on this instance of LQR by using Adam, and it’s not quite as egregious as my notebook made it look. Though it’s still not competitive with a model-based method for this simple problem.</p>
<p>For what it’s worth, neither Pavel or I could get standard gradient descent to converge for this problem. If any of you can get ordinary SGD to work, please let me know!</p>
<p>Despite this positive development, I have to say I remain discomforted. I have been told by multiple people that Policy Gradient is a strawman, and we need to add heuristics for baselines and special solvers on top of the original estimator to make it work. But if that is true, why do we still do worse than pure random search? Perhaps adding more to the problem can improve performance: maybe a trust-region with inexact CG solve, or value function estimation (we’ll explore this in a future post). But the more parameters we add, the more we can just overfit to this simple example.</p>
<p>I do worry a lot in ML in general that we deceive ourselves when we add algorithmic complexity on a small set of benchmark instances. As an illustrative example, let’s now move to a considerably harder instance of LQR. Let’s go from idealized quadrotor models to idealized datacenter models, as everyone knows that RL is the <a href="https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/">go-to approach for datacenter cooling</a>. Here’s a very rough linear model of a collection of three server racks each with their own cooling devices:</p>
<p class="center"><img src="/assets/rl/policies/fake_datacenter.png" alt="fake datacenter" width="560px" /></p>
<p>Each component of the state $x$ is the internal temperature of one of the racks, and their traffic causes them to heat up with a constant load. They also shed heat to their neighbors. The control enables local cooling. This gives a linear model</p>
<script type="math/tex; mode=display">x_{t+1} = Ax_t + Bu_t+w_t</script>
<p>Where</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{bmatrix} 1.01 & 0.01 & 0\\ 0.01 & 1.01 & 0.01 \\ 0 & 0.01 & 1.01 \end{bmatrix}
\qquad \qquad B = I %]]></script>
<p>This is a toy, but it’s instructive. Let’s try to solve the <a href="http://www.argmin.net/2018/02/08/lqr">LQR problem</a> with the settings $Q = I$ and $R= 1000 I$. This models trying hard to minimize the power consumption of the fans but still keeping the data center relatively cool. Now what happens for our RL methods on this instance?</p>
<p>I tuned the parameters of Adam (even though I am told that Adam never needs tuning), and this is the best I can get:</p>
<p class="center"><img src="/assets/rl/policies/cost_infinite_datacenter.png" alt="hard instance inf" width="250px" />
<img src="/assets/rl/policies/stabilizing_datacenter.png" alt="hard instance stabilizing" width="250px" /></p>
<p>You might be able to tune this better than me, and I’d encourage you to try (<a href="https://nbviewer.jupyter.org/url/argmin.net/code/lqr_fake_datacenter_demo.ipynb">python notebook</a> for the intrepid). Again, I would love feedback here as I’m trying to learn the ins and outs of the space as much as everyone reading this.</p>
<p>What is more worrying to me is that if I change the random seed to 1336 but keep the parameters the same, the performance degrades for PG:</p>
<p class="center"><img src="/assets/rl/policies/cost_infinite_datacenter_1336.png" alt="don't tune on random seeds" width="250px" />
<img src="/assets/rl/policies/stabilizing_datacenter_1336.png" alt="this doesn't look good" width="250px" /></p>
<p>That means that we’re still very much in a very high variance regime for Policy Gradient.</p>
<p>Now note, even though random search is better than policy gradient here, random search is still really bad. It is still finding many unstable solutions on this harder instance. That’s certainly less than ideal. Even if random search is better than deep RL, I probably wouldn’t use it in my datacenter. This for me is the main point. We can tune model-free methods all we want, but, I think there are fundamental limitations to this methodology. <strong>By throwing away models and knowledge, it is never clear if we can learn enough from a few instances and random seeds to generalize.</strong> I revisit this on considerably more challenging problems in the <a href="http://www.argmin.net/03/20/mujocoloco">next post</a>.</p>
<p>What makes this example hard? In order to understand the hardness, we have to understand the instance. The underlying dynamics are <em>unstable</em>. This means, unless a proper control is applied, the system will blow up (and the servers will catch fire). If you look at the last line of the notebook, you’ll see that even the nominal controller is producing an unstable solution with one rollout. This makes sense: if we estimate one of the diagonal entries of $A$ to be less than $1$, we might guess that this mode is stable and put less effort to cooling that rack. So it’s imperative to get a high quality estimate of the system’s true behavior for near optimal control. Or rather, we have to be able to ascertain whether or not our current policy is safe or the consequences can be disastrous. Though this series seems to be ever-expanding, an important topic of a future post is how to tightly link in safety and robustness concerns when learning to control.</p>
Tue, 13 Mar 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/03/13/pg-saga/
http://benjamin-recht.github.io/2018/03/13/pg-saga/A Model, You Know What I Mean?<p><em>This is the seventh part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 8 is <a href="http://www.argmin.net/2018/03/13/pg-saga/">here</a>. Part 6 is <a href="http://www.argmin.net/2018/02/20/reinforce/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>The role of models in reinforcement learning remains hotly debated. <em>Model-free</em> methods, like policy gradient, aim to solve optimal control problems only by probing the system and improving strategies based on past awards and states. Many researchers argue for systems that can innately learn without the complication of the complex details of required to simulate a physical system. They argue that it is often easier to find a policy for a task than it is to fit a general purpose model of the system dynamics.</p>
<p>On the other hand, in continuous control problems <em>we always</em> have models. The idea that we are going to build a self-driving car from trial and error is ludicrous. Fitting models, while laborious, is not out of the realm of possibilities for most systems of interest. Moreover, often times a coarse model suffices in order to plan a nearly optimal control strategy. How much can a model improve performance even when the parameters are unknown or the model doesn’t fully capture all of the system’s behavior?</p>
<p>In this post, I’m going to look at one of the simplest uses of a model in reinforcement learning. The strategy will be to estimate a predictive model for the dynamical process and then to use it in a dynamic programming solution to the prescribed control problem. Building a control system as if this estimated model were true is called <em>nominal control</em>, and the estimated model is called the <em>nominal model</em>. Nominal control will serve as a useful baseline algorithm for the rest of this series. In this post, let’s unpack how nominal control might work for the simple LQR problem.</p>
<h2 id="system-identification">System identification</h2>
<p>Estimation of dynamical systems is called <em>system identification</em> in the controls community. System Identification differs from conventional estimation because one needs to carefully choose the right inputs to excite the various degrees of freedom and because dynamical outputs are correlated over time with the parameters we hope to estimate. Once data is collected, however, conventional machine learning tools are used to find the system that best agrees with the data.</p>
<p>Let’s return to our abstract dynamical system model</p>
<script type="math/tex; mode=display">x_{t+1} = f(x_t,u_t,e_t)</script>
<p>We want to build a predictor of $x_{t+1}$ from $(x_t,u_t,e_t)$. The question is <em>how much do we need to model</em>? Do we use a complicated physical model that is given by physics? Or do we approximate $f$ non-parametrically, say using a neural network? How do we fit the model to guarantee good out of sample prediction?</p>
<p>This question remains an issue even for linear systems. Let’s go back to the toy example I’ve been using throughout this series: <a href="http://www.argmin.net/2018/02/01/control-tour/">the quadrotor dynamics</a>. In the <a href="http://www.argmin.net/2018/02/08/lqr/">LQR post</a>, We modeled control of a quadrotor as an LQR problem</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{minimize}_{u_t,x_t} \, & \frac{1}{2}\sum_{t=0}^{N-1} x_{t+1}^TQ x_{t+1} + u_t^T R u_t \\
\mbox{subject to} & x_{t+1} = A x_t+ B u_t, \\
& \qquad \mbox{for}~t=0,1,\dotsc,N,\\
& \mbox{($x_0$ given).}
\end{array} %]]></script>
<p>Where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A &= \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}\,
&
\qquad B &= \begin{bmatrix} 0\\ 1 \end{bmatrix}\,
\\
Q &= \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}\,
& \qquad
R &= 1
\end{aligned} %]]></script>
<p>and we assume that $x0 = [-1,0]$.</p>
<p>Given such a system, what’s the right way to identify it? A simple, classic strategy is simply to inject a random probing sequence $u_t$ for control and then measure how the state responds. A model can be fit by solving the least-squares problem</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{minimize}_{A,B} & \sum_{t=0}^{N-1} ||x_{t+1} - A x_t - B u_t||^2\,.
\end{array} %]]></script>
<p>Let’s label the minimizers $\hat{A}$ and $\hat{B}$. These are our point estimates for the model. With such point estimates, we can solve the LQR problem</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{minimize}_{u_t,x_t} \, & \frac{1}{2}\sum_{t=0}^{N-1} x_{t+1}^TQ x_{t+1} + u_t^T R u_t \\
\mbox{subject to} & x_{t+1} = \hat{A} x_t+ \hat{B} u_t, \\
& \qquad \mbox{for}~t=0,1,\dotsc,N,\\
& \mbox{($x_0$ given).}
\end{array} %]]></script>
<p>In this case, we are solving the wrong problem to get our trajectory $u_t$. But you could imagine that if $(\hat{A},\hat{B})$ and $(A,B)$ are close, this should work pretty well.</p>
<p>Given the fact that there will be errors in the estimation, and that we don’t yet know what the right signal is to excite the different system modes, it is not clear at all how well this will work in practice. And our understanding of what is the optimal probing strategy and optimal estimation rates is still open. But it does seem like a sensible approach, and lack of theory should never stop us from trying something out…</p>
<h2 id="comparing-policies">Comparing Policies</h2>
<p>I ran experiments for the quadrotor problem with a little bit of noise ($e_t$ zero-mean with covariance $10^{-4} I$), and with time horizon $T=10$. The input sequence I chose was Gaussian white noise with unit variance. Code for all of these experiments is <a href="https://nbviewer.jupyter.org/url/argmin.net/code/lqr_policy_comparisons_original.ipynb">in this python notebook</a>.</p>
<p>With one iteration (10 samples), the nominal estimate is correct to 3 digits of precision. Not bad! And, not surprisingly, this returns a nearly optimal control policy. Right out of the box, this nominal control strategy worked pretty well. Note that I even made the nominal control’s life hard here. I used none of the sparsity information or domain knowledge. In reality, I should only have to estimate one parameter: the (2,1) entry in $B$ which governs how much force is put out by the actuator and how much mass the system has.</p>
<p>Now, here’s where I get to pick on model-free methods again. How do they fare on this problem? The first thing I did was restrict my attention to policies that used a static, linear gain. I did not want to wade into neural networks. This is <em>helping</em> the model free methods, as a static linear policy works almost as well as a time varying policy for this simple 2-state LQR problem. Moreover, there are only <em>2 decision variables</em>. Just two numbers to identify. Should be a piece of cake, right?</p>
<p>I compared three model free strategies</p>
<ul>
<li>
<p><strong>Policy Gradient.</strong> I coded this up myself, so I’m sure it’s wrong. But I used a simple baseline subtraction heuristic to reduce variance, and also added bound constraints to ensure that PG didn’t diverge.</p>
</li>
<li>
<p><strong>Random search.</strong> A simple <a href="http://www.argmin.net/2017/04/03/evolution/">random search heuristic</a> that uses finite difference approximations across random axes.</p>
</li>
<li>
<p><strong>Uniform sampling</strong> I picked a bunch of random controllers from a bounded cube in $\mathbb{R}^2$ and returned the one that yielded the lowest LQR cost.</p>
</li>
</ul>
<p>How do these fare? I ran each of these methods, using 10 different random seeds, and plot the best results of 10 here:</p>
<p class="center"><img src="/assets/rl/policies/cost_finite.png" alt="position" width="409px" /></p>
<p>After about 500 rollouts (that is, 500 trials of length 10 with different control settings), all of the methods seem equally good. Though, again, this is comparing to <em>one rollout</em> for nominal control. That’s a pretty big hit to be taking in sample complexity in 2D. Random search seems to be a bit better than policy gradient, but, perhaps unsurprisingly, uniform sampling is better than both of them. It’s 2D after all.</p>
<p>The story changes if I include error bars. Now, rather than plotting the best performance, I plot the median performance</p>
<p class="center"><img src="/assets/rl/policies/cost_finite_err_bars.png" alt="position" width="560px" /></p>
<p>The error bars here are encompassing the max and min over all trials. Policy gradient looks much worse than it did before. And indeed, its variance is rather high. After 4000 rollouts, its median performance is on par with the nominal control with a single rollout. But note that the worst case performance is still unstable even with 5000 simulations.</p>
<p>My problem with policy gradients is how do I debug this? I probably have a bug! But how can I tell? What’s a unit test? What’s a simple diagnostic to know that this is working other than the fact that the cost tends to improve with more samples?</p>
<p>Of course, it’s probable that I didn’t tune the random seed properly to get this to work. I used 1337, as suggested by Moritz Hardt, but other values surely work better. Perhaps a better baseline would improve things? Or maybe I could add a critic? Or I could use something more sophisticated like a Trust Region method?</p>
<p>All of these questions are asking for more algorithmic complexity and are missing the forest for the trees. The major issue is model-free methods are several orders of magnitude worse than a parameter-free, model-based method. If you have models, you really should use them!</p>
<h2 id="metalearning">MetaLearning</h2>
<p>Another claim frequently made in RL is that policies learned in one task may generalize to other tasks. A rather simple form of generalization would be to be able to achieve high performance on the LQR problem as we change the time horizon. That is, what if we make the length of the horizon arbitrarily long, will the policies still achieve high performance?</p>
<p>One way to check this would be to see what the cost looks like on an <em>infinite</em> time horizon. If we do nominal control, we can plug in our point estimate and solve a Ricatti equation and produce a controller for an infinite time horizon. As expected, this is nearly optimal for the quadrotor model.</p>
<p>But what about for the model-free approaches? Do the learned controllers generalize to arbitrarily long horizons? With model-free methods, we are stuck with a fixed controller, but can test it on the infinite time horizon regardless. The results look like this:</p>
<p class="center"><img src="/assets/rl/policies/cost_infinite.png" alt="position" width="560px" /></p>
<p>The error bars for Policy Gradient are all over the place here, and the median is indeed infinite up to 2000 rollouts. What is happening is that on an infinite time horizon, it is necessary for the controller to be <em>stabilizing</em> so that the trajectories don’t blow up. A necessary and sufficient condition for stabilization is for the matrix $A+BK$ to have all of its eigenvalues to be less than 1. This makes sense, as the closed-loop system takes the form</p>
<script type="math/tex; mode=display">x_{t+1} = (A+BK)x_t + e_t</script>
<p>If $A+BK$ has an eigenvalue with magnitude greater than 1, then that corresponding eigenvector will be amplified exponentially quickly. We can plot how frequently the various search methods find stabilizing control policies when looking at a finite horizon. Recall that nominal control finds such a policy with one simulation.</p>
<p class="center"><img src="/assets/rl/policies/stabilizing.png" alt="position" width="412px" /></p>
<p>Uniform sampling and random search do eventually tend to only find stabilizing policies, but that they still require a few hundred simulations to ensure stability. Policy gradient, on the other hand, never returns a stabilizing policy more than 90 percent of the time, even after thousands of simulations.</p>
<h2 id="a-coarse-model-cant-do-everything">A Coarse Model Can’t Do Everything</h2>
<p>Though I expect that a well-calibrated coarse model will outperform a model-free method on almost any task, I want to close by emphasize that we do not know the limits of model-based RL any more than we know the limits of model-free. Even understanding the complexity of estimation of linear, time-invariant systems remains an open theoretical challenge. In the nonlinear case, affairs are only harder. We don’t fully understand the limits and fragilities of nominal control for LQR, and we don’t know just how coarse of a model is needed to attain satisfactory control performance. In future posts, I will address some of these limits and the open problems we’ll need to solve in order make learning a first-class citizen in contemporary control systems.</p>
Mon, 26 Feb 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/02/26/nominal/
http://benjamin-recht.github.io/2018/02/26/nominal/The Policy of Truth<p><em>This is the sixth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 7 is <a href="http://www.argmin.net/2018/02/26/nominal/">here</a>. Part 5 is <a href="http://www.argmin.net/2018/02/14/rl-game/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>Our first generic candidate for solving reinforcement learning is <em>Policy Gradient</em>. I find it shocking that Policy Gradient wasn’t ruled out as a bad idea in 1993. Policy gradient is seductive as it apparently lets one fine tune a program to solve any problem without any domain knowledge. Of course, anything that makes such a claim must be too general for its own good. Indeed, if you dive into it, <strong>policy gradient is nothing more than random search dressed up in mathematical symbols and lingo</strong>.</p>
<p>I apologize in advance that this is one of the more notationally heavy posts. Policy Gradient makes excessive use of notation to fool us into thinking there is something deep going on. My guess is that part of the reason Policy Gradient remained a research topic was because people didn’t implement it and the mathematics <a href="http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf">looked so appealing on its own</a>. This makes it easy to lose sight of what would happen if the method actually got coded up. See if you can find the places where leaps of faith occur.</p>
<h2 id="adding-abstraction-until-the-problem-is-solved">Adding abstraction until the problem is solved</h2>
<p>Let’s start with the super general problem that people solve with policy gradient. Recall that a <em>trajectory</em> is a sequence of states $x_k$ and control actions $u_k$ generated by a dynamical system,</p>
<script type="math/tex; mode=display">\tau_t = (u_1,…,u_{t-1},x_0,…,x_t) \,,</script>
<p>and a <em>policy</em> is a function, $\pi$, that takes a trajectory and outputs a new control action. Our goal remains to find a policy that maximizes the total reward after $L$ time steps.</p>
<p>In policy gradient, we fix our attention on <em>parametric, randomized policies</em>. The policy $\pi$ has a list of parameters to tune, $\vartheta$. And rather than returning a specific control action, we assume that $\pi$ is a probability distribution over actions. An action is chosen in practice in each step by <em>sampling</em> from this distribution $\pi$. You might ask, why are we sampling? That’s a great question! But let’s not get bogged down by reasonable questions and press on.</p>
<p>Let’s write $\pi_\vartheta$ to make the dependence on the parameters $\vartheta$ explicit. Since $\pi_\vartheta$ is a probability distribution, using $\pi_\vartheta$ as a policy induces a probability distribution over trajectories:</p>
<script type="math/tex; mode=display">p(\tau;\vartheta) = \prod_{t=0}^{L-1} p(x_{t+1} \vert x_{t},u_{t}) \pi_\vartheta(u_t\vert \tau_t)\,.</script>
<p>Moreover, we can overload notation and define the reward of a trajectory to be</p>
<script type="math/tex; mode=display">R(\tau) = \sum_{t=0}^N R_t(x_t,u_t)</script>
<p>Then our optimization problem for reinforcement learning tidily becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{maximize}_{\vartheta} & \mathbb{E}_{p(\tau \vert \vartheta)}[ R(\tau)]
\end{array} %]]></script>
<p>We can make this even cleaner by defining</p>
<script type="math/tex; mode=display">J(\vartheta) := \mathbb{E}_{p(\tau \vert \vartheta)}[ R(\tau) ]\,.</script>
<p>Our goal in reinforcement learning can now be even more compactly written as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{maximize}_{\vartheta} & J(\vartheta)\,.
\end{array} %]]></script>
<h2 id="policy-gradient">Policy Gradient</h2>
<p>Having set up the problem in tidy notation, Policy Gradient can now be derived by the following clever trick:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla_{\vartheta} J(\vartheta) &= \int R(\tau) \nabla_{\vartheta} p(\tau;\vartheta) d\tau\\
&= \int R(\tau) \left(\frac{\nabla_{\vartheta} p(\tau;\vartheta)}{p(\tau;\vartheta)}\right) p(\tau;\vartheta) d\tau\\
&= \int \left( R(\tau) \nabla_{\vartheta} \log p(\tau;\vartheta) \right) p(\tau;\vartheta)d\tau \\
&= \mathbb{E}_{p(\tau;\vartheta)}\left[ R(\tau) \nabla_{\vartheta} \log p(\tau;\vartheta) \right]\,.
\end{align*} %]]></script>
<p>This calculation reveals that the gradient of $J$ with respect to $\vartheta$ is the expected value of the function</p>
<script type="math/tex; mode=display">G(\tau,\vartheta) = R(\tau) \nabla_{\vartheta} \log p(\tau;\vartheta)</script>
<p>Hence, if we sample a trajectory $\tau$ by running policy $\pi_\vartheta$, we can compute $G(\tau,\vartheta)$ and will have an unbiased estimate of the gradient of $J$. We can follow this direction and will be running stochastic gradient descent on $J$.</p>
<p>What is more magic, is that the function $G(\tau,\vartheta)$ can be computed without knowing the equations that govern the dynamical system. To see this note that</p>
<script type="math/tex; mode=display">p(x_{t+1}|x_{t},u_{t})</script>
<p>is <em>not</em> a function of the parameter $\vartheta$. Hence,</p>
<script type="math/tex; mode=display">\nabla_\vartheta \log p(\tau;\vartheta) = \sum_{t=0}^{L-1} \nabla_\vartheta \log \pi_\vartheta(u_t|\tau_t)\,.</script>
<p>These derivatives can be computed provided that $\pi_\vartheta$ is differentiable and you have the latest version of <a href="https://github.com/HIPS/autograd">autograd</a> installed.</p>
<p>To sum up, we have a fairly miraculous method that lets us optimize an optimal control problem without knowing anything about the dynamics of the system.</p>
<ol>
<li>Choose some initial guess $\vartheta_0$ and stepsize sequence ${\alpha_k}$. Set $k=0$.</li>
<li>Sample $\tau_k$ by running the simulator with policy $\pi_{\vartheta_k}$.</li>
<li>Set $\vartheta_{k+1} = \vartheta_k + \alpha_k R(\tau_k) \sum_{t=0}^{L-1} \nabla_\vartheta \log \pi_\vartheta(u_{tk}\vert \tau_t)$.</li>
<li>Increment $k=k+1$ and go to step 2.</li>
</ol>
<p>The main appeal of policy gradient is that it is this easy. If you can efficiently sample from $\pi_\vartheta$, you can run this algorithm on essentially any problem. You can fly quadcopters, you can cool data centers, you can teach robots to open doors. The question becomes, of course, can you do this well? I think that a simple appeal to the Linearization Principle will make it clear that Policy Gradient is likely never the algorithm that you’d want to use.</p>
<h2 id="why-are-we-using-probabilistic-policies-again">Why are we using probabilistic policies again?</h2>
<p>Before talking about linear models, let’s step back and consider a pure optimization setup. We added a bunch of notation to reinforcement learning so that at the end, it seemed like we were only aiming to maximize an unconstrained function. Let’s remove all of the dynamics and consider the <em>one step</em> optimal control problem. Given a function $R(u)$, I want to find the $u$ that makes this as large as possible. That is, I’d like to solve the optimization problem</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{maximize}_u & R(u) \,.
\end{array} %]]></script>
<p>Now, bear with me for a second into a digression that might seem tangential. Any optimization problem like this is equivalent to an optimization over probability distributions on $u$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{maximize}_{p(u)} & \mathbb{E}_p[R(u)]
\end{array} %]]></script>
<p>The equivalence goes like this: if $u_\star$ is the optimal solution, then we’ll get the same reward if you put a Delta-function around $u_\star$. Moreover, if $p$ is a probability distribution, it’s clear that the <em>expected reward</em> can never be larger than maximal reward achievable by a fixed $u$. So we can either optimize over $u$ or we can optimize over <em>distributions</em> over $u$.</p>
<p>Now here is the first logical jump in Policy Gradient. Rather than optimizing over the space of of all probability distributions, we optimize over a parametric family $p(u;\vartheta)$. If this family contains all of the Delta functions, then the optimal value will coincide with the non-random optimization problem. But if the family does not contain the Delta functions, we will only get an lower bound on the optimal reward no matter how good of a probability distribution we find. In this case, if you sample $u$ from the policy, their expected reward will necessarily be suboptimal.</p>
<p>A major problem with this paradigm of optimization over distributions is that we have to balance many requirements for our family of distributions. We need probability distributions that are</p>
<ol>
<li>rich enough to approximate delta functions</li>
<li>easy to search by gradient methods</li>
<li>easy to sample from</li>
</ol>
<p>That’s a lot of demands to place upon distributions, <em>especially when your control actions take continuous values</em>. For continuous actions, more often than not people will choose a family of Gaussian distributions so that</p>
<script type="math/tex; mode=display">u_t = f(\tau_t) + g_t</script>
<p>Here, $f$ is some nonlinear function and $g_t$ is a Gaussian random vector. No parameterization like this contains the Delta functions. And it is not clear how much we lose by making such a parameterization <strong>because we’re not allowed to model anything in reinforcement learning</strong>.</p>
<p>It’s important at this point to reemphasize <em>there is no need for a randomized policy in the basic optimal control problem we have been studying.</em> And there is certainly no need for the simple LQR problem. The probabilistic policy is a modeling choice, and one that is never better than a deterministic policy.</p>
<h2 id="the-super-general-reinforce-algorithm">The super general REINFORCE algorithm</h2>
<p>So it turns out that this Policy Gradient algorithm is in fact a general purpose method for finding stochastic gradients of rewards of the form</p>
<script type="math/tex; mode=display">J(\vartheta):=\mathbb{E}_{p(u;\vartheta)}[R(u)]</script>
<p>The log-likelihood trick works in full generality here:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla_{\vartheta} J(\vartheta) &= \int R(u) \nabla_{\vartheta} p(u;\vartheta) du\\
&= \int R(u) \left(\frac{\nabla_{\vartheta} p(u;\vartheta)}{p(u;\vartheta)}\right) p(u;\vartheta) du\\
&= \int \left( R(u) \nabla_{\vartheta} \log p(u;\vartheta) \right) p(u;\vartheta)du \\
&= \mathbb{E}_{p(u;\vartheta)}\left[ R(u) \nabla_{\vartheta} \log p(u;\vartheta) \right]\,.
\end{align*} %]]></script>
<p>And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:</p>
<ol>
<li>Choose some initial guess $\vartheta_0$ and stepsize sequence ${\alpha_k}$. Set $k=0$.</li>
<li>Sample $u_k$ i.i.d., from $p(u;\vartheta_k)$.</li>
<li>Set $\vartheta_{k+1} = \vartheta_k + \alpha_k R(u_k) \nabla_{\vartheta} \log p(u_k;\vartheta_k)$.</li>
<li>Increment $k=k+1$ and go to step 2.</li>
</ol>
<p>The algorithm in this form is called REINFORCE. It seems weird: we get a stochastic gradient, but the function we cared about optimizing—$R$—is only accessed through function evaluations. We never compute gradients of $R$ itself. So is this algorithm any good?</p>
<p>It depends on what you are looking for. If you’re looking for something to compete with gradients, no. It’s a terrible algorithm. If you’re looking for an algorithm to compete with a finite difference approximation to $R$ then… it’s still a terrible algorithm. But the math is cute.</p>
<p>The thing is, the Linearization Principle suggests this is algorithm should be discarded almost immediately. Let’s consider the most trivial example of LQR:
<script type="math/tex">R(u) = -||u-z||^2</script>
Let $p(u;\vartheta)$ be a multivariate Gaussian with mean $\vartheta$ and variance $\sigma^2 I$. What does policy gradient do? First, note that</p>
<script type="math/tex; mode=display">\mathbb{E}_{p(u;\vartheta)} [R(u)]= -\|\vartheta-z\|^2 - \sigma^2 d</script>
<p>Obviously, the best thing to do would be to set $\vartheta=z$. Note that the expected reward is off by $\sigma^2 d$ at this point, but at least this would be finding a good guess for $u$. Also, as a function of $\vartheta$, $J$ is <em>strongly convex</em>, and the most important thing to know is the expected norm of the gradient as this will control the number of iterations. Now, if you start at $\vartheta=0$, then the gradient is</p>
<script type="math/tex; mode=display">g=\frac{||\omega-z||^2 \omega}{\sigma^2}\,,</script>
<p>where $\omega$ is a normally distributed random vector with mean zero and covariance $\sigma^2 I$.
The expected norm of this stochastic gradient is… gross. You need to compute 6th order moments, and that’s never fun. But if you grind through the details, you’ll see the expected norm is on the order of</p>
<script type="math/tex; mode=display">O\left(\sigma d^{1.5} + \sigma^{-1} d^{0.5} \|z\|\right)\,.</script>
<p>That’s quite large! The scaling with dimension is rather troubling.</p>
<p>Many people have analyzed the complexity of this method, and <a href="http://alekhagarwal.net/bandits-colt.pdf">it is indeed not great</a> and strongly depends on the dimension of the search space. It also depends on the largest magnitude reward $B$. If the function values are noisy, even for convex functions, the convergence rate is $O((d^2B^2/T)^{-1/3})$, and this assumes you get the algorithm parameters exactly right. For strongly convex functions, you can possibly eke out a decent solution in $O((d^2B^2/T)^{-1/2})$ function evaluations, but this result is also rather fragile to choice of parameters. Finally, note that just adding an constant offset to the reward dramatically slows down the algorithm. If you start with a reward function whose values are in $[0,1]$ and you subtract one million from each reward, this will increase the running time of the algorithm by a factor of a million, even though the ordering of the rewards amongst parameter values remains the same.</p>
<p>Note that matters only get worse as we bring in dynamics. The policy gradient update for LQR is very noisy, and its variance grows with the simulation length $L$. Moreover, the search for $\vartheta$ is necessarily nonconvex if one is searching for a simple static policy. While this could work in practice, we already have so many hurdles in our face that it suggests we should look for an alternative.</p>
<h2 id="how-can-people-be-claiming-such-success-in-rl">How can people be claiming such success in RL?</h2>
<p>Lots of papers have been applying policy gradient to all sorts of different settings, and claiming crazy results, but I hope that it is now clear that they are just dressing up <a href="https://en.wikipedia.org/wiki/Random_search">random search</a> in a clever outfit. When you end up with a bunch of papers showing that <a href="https://twitter.com/OriolVinyalsML/status/960927537005322243">genetic algorithms are competitive with your methods</a>, this does not mean that we’ve made an advance in genetic algorithms. It is far more likely that this means that your method is a lousy implementation of random search.</p>
<p>Regardless, both genetic algorithms and policy gradient require an absurd number of samples. This is OK <a href="https://twitter.com/beenwrekt/status/961263599674150912">if you are willing to spend millions of dollars on AWS</a> and never actually want to tune a physical system. But there must be a better way.</p>
<p>I don’t think I can overemphasize the point that policy gradient and RL are not magic. I’d go as far as to say that policy gradient and its derivatives are legitimately bad algorithms. In order to make them work well, you need lots of tricks. <a href="https://blog.openai.com/openai-baselines-dqn/">Algorithms which are hard to tune, hard to reproduce</a>, and don’t outperform off the shelf genetic algorithms are bad algorithms.</p>
<p>We’ll come back to this many times in this series: for any application where policy gradient is successful, a dramatically simpler and more robust algorithm exists that will match or outperform it. It’s never a good idea, and I cannot for the life of me figure out why it is so popular.</p>
<p>Indeed! <a href="http://www.argmin.net/2018/02/26/nominal/">In the next post</a> I’ll turn back to LQR and look at some other strategies that might be more successful than policy gradient.</p>
Tue, 20 Feb 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/02/20/reinforce/
http://benjamin-recht.github.io/2018/02/20/reinforce/A Game of Chance to You to Him Is One of Real Skill<p><em>This is the fifth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 6 is <a href="http://www.argmin.net/2018/02/20/reinforce/">here</a>. Part 4 is <a href="http://www.argmin.net/2018/02/08/lqr/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>The first two parts of this series highlighted two parallel <em>aspirations</em> of the current research in RL: <a href="http://www.argmin.net/2018/01/29/taxonomy/">part 1</a> described reinforcement learning as prescriptive analytics and <a href="http://www.argmin.net/2018/02/01/control-tour/">part 2</a> as optimal control. This post, by contrast, is going to focus on how people typically <em>use</em> RL, both in practice and in papers. The reality of RL is often quite different than the rhetoric, and I want to spend time here to separate the two so it will be easier to understand the limits of different methodologies and algorithms.</p>
<p>There are a set of rules that are abided by, and the rules are agreed upon by loose precedent. I want to delineate these rules and then describe the connections to established inquiries in control system design and analysis.</p>
<h2 id="trajectories-and-policies">Trajectories and Policies</h2>
<p>Let’s begin by revisiting our abstract dynamical system model</p>
<script type="math/tex; mode=display">x_{t+1} = f( x_t, u_t, e_t)\,.</script>
<p>Again, $x_t$ is the <em>state</em> of the system, $u_t$ is the control action, and $e_t$ is a random disturbance. We’re going to assume that $f$ is fixed, but unknown.</p>
<p>I will refer to a <em>trajectory</em> as a sequence of states and control actions generated by this dynamical system.</p>
<script type="math/tex; mode=display">\tau_t = (u_1,…,u_{t-1},x_0,…,x_t) \,.</script>
<p>A <em>control policy</em> (or simply “a policy”) is a function, $\pi$, that takes a trajectory from a dynamical system and outputs a new control action. Note that $\pi$ only gets access to previous states and control actions.</p>
<p>For example, in LQR on long time horizons we know that the policy</p>
<script type="math/tex; mode=display">\pi(\tau_t) = K x_t</script>
<p>will be nearly optimal for a fixed matrix $K$. But arbitrarily complicated policies are possible for general RL problems.</p>
<p>Optimal control and reinforcement learning can be equivalently posed as trying to find the policy that maximizes an expected reward.</p>
<h2 id="go-ask-the-oracle">Go ask the oracle</h2>
<p>Recall that our main goal in reinforcement learning is to solve the optimal control problem</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{maximize}_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R_t[x_t,u_t] ]\\
\mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\
& \mbox{($x_0$ given).}
\end{array} %]]></script>
<p>But we assume that we don’t know the function $f$. Some lines of work even assume that we don’t know the reward function $R$. I personally feel that not knowing $R$ is unrealistic for any practical problem, and, moreover, that $R$ is actually a design parameter in engineering applications. However, for the purpose of this post, it will make no difference as to whether $R$ is known or unknown.</p>
<p>The important point is that since we can’t solve this optimization problem using standard optimization methods unless we know the function $f$ governing the dynamics. We must learn something about the dynamical system and subsequently choose the best policy based on our knowledge. How do we measure success? We need to balance both the final expected reward of our policy and the number of times we need to interrogate the system to find this policy.</p>
<p>The main paradigm in contemporary RL is to play the following game. We decide on a policy $\pi$ and horizon length L. Then we either pass this policy to a simulation engine or to a real robotic system and are returned a trajectory</p>
<script type="math/tex; mode=display">\tau_L = (u_1,…,u_{L-1},x_0,…,x_L)\,,</script>
<p>where $u_t = \pi(\tau_t)$. This is our <em>oracle model</em>. We typically want to minimize the total number of samples computed by the oracle. So if we were to run $m$ queries with horizon length $L$, we would pay a total cost of $mL$. However, we are free to vary our horizon length for each experiment.</p>
<p>So let’s denote by $n$ the total number of oracle accesses. At the end of the day we want the expected reward to be high for our derived policy, but we also need the number of oracle queries to be small.</p>
<p>Phew. This is already complicated! Note that this framing of the problem makes it very hard to decide on a “best” algorithm. Do we decide an algorithm is best if it achieves some reward in the fewest number of samples? Or is an algorithm best if it achieves the highest rewards given a fixed budget of samples? Or maybe there’s a middle ground? I’ll return to such issues about measuring the relative abilities of different RL methods later in this series.</p>
<h2 id="iterative-learning-control">Iterative Learning Control</h2>
<p>Control theorists have a different name for this RL game. They call it <em>iterative learning control</em> (ILC). In ILC, the focus is on designing control systems that perform a repetitive task, and the design is refined by leveraging repetition. A common example is <a href="http://www.dynsyslab.org/wp-content/papercite-data/pdf/schoellig-ecc09.pdf">learning to track a trajectory</a>, and the input control is improved by adjustment with respect to the deviation from the desired trajectory in previous iterations. ILC is a useful and mature sub-discipline of control theory and has achieved many industrial success stories. Also, it is not an exaggeration to say that the embodiments of iterative learning control in actual physical systems blow any RL demo out of the water. Here are <a href="https://www.youtube.com/watch?v=4kHDv9senpE">some</a> <a href="https://youtu.be/goVuP5TJIUU">insane</a> <a href="https://www.youtube.com/watch?v=IZTP7h5cfqg">youtubes</a>. Hopefully you’ll come back and read the rest of this post after you get stuck in a rabbit hole of watching mind boggling quadrotor acrobatics.</p>
<p>Iterative learning control and RL merely differ insofar as what information they provide to the control design engineer. In RL, the problems are constructed to hide as much information about the dynamical system as possible. Even though RL practice uses physics simulators that are generated from well-specified differential equations, we have to tie our hands behind our back pretending like we don’t know basic mechanics and that we don’t understand the desired goals of our control system. As a result, RL schemes require millions of training examples to achieve reasonable performance. ILC on the other hand typically never requires more than a few dozen iterations to exceed human performance. But ILC typically requires reasonable models about the underlying system dynamics, and often assumes fairly well specified dynamics. Is there a middle ground here where we can specify a coarse model but still learn on actual physical systems in a short amount of time?</p>
<p>Understanding this tradeoff between modeling and number of required iterations is a fascinating practical and theoretical challenge. What new insights can be gleaned from comparing and contrasting classical control approaches with reinforcement learning techniques? How well do we need to understand a system in order to control it? In the next few posts, I’ll describe a variety of different approaches to the RL game and how different techniques choose to optimize inside of these rules. We’ll dive in with my least favorite of the bunch <a href="http://www.argmin.net/2018/02/20/reinforce/">Policy Gradient</a>.</p>
Wed, 14 Feb 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/02/14/rl-game/
http://benjamin-recht.github.io/2018/02/14/rl-game/