arg min blogMusings on systems, information, learning, and optimization.
http://benjamin-recht.github.io/
You Cannot Serve Two Masters: The Harms of Dual Affiliation<p>Facebook would like to have computer science faculty in AI committed to work 80% of their time in industrial jobs and 20% of their time at their university. They call this scheme “<a href="https://newsroom.fb.com/news/2018/07/facebook-ai-research-expands/">co-employment</a>” or “<a href="https://www.facebook.com/schrep/posts/10156638732909443">dual</a> <a href="https://www.businessinsider.com/facebook-yann-lecun-dual-affiliation-model-ai-experts-2018-8">affiliation</a>.” This model assumes people can slice their time and attention like a computer, but people can’t do this. Universities and companies are communities, each with their particular missions and values. The values of these communities are often at odds, and researchers must choose where their main commitment lies. By committing researchers to a particular company’s interests, this new model of employment will harm our colleagues, our discipline, and everyone’s future. Like many harms, it comes with benefits for some. But the harm in this proposal outweighs the benefits. If industry wants to support and grow academic computer science, there are much better ways to achieve this.</p>
<p>The proposal will harm our discipline, because it will distract established talent from the special roles of academics: curiosity driven research. Academic scholarship has an excellent record of pursuing ideas into places that are exciting and productive, even if they don’t result in immediate, tangible benefits and especially if they ruffle the feathers of established, powerful institutions. You can’t do that if 80% of your time is spent not annoying a big company. Though big companies belabor promises of complete intellectual freedom to faculty, that can’t and won’t happen because the purpose of companies is to make money for shareholders.</p>
<p>The proposal harms our students directly. Our faculty at their best secure everyone’s future by teaching talented students how to understand the challenges facing the broader world. Such mentorship is enriched by the courage, independence, security, and trained judgement of senior scholars to guide students’ perspectives on what is worth doing, what is likely irrelevant, and what is wrong. Engaging with a student body requires an all-in commitment, both in teaching and advising roles. Faculty primarily working elsewhere means cancelled classes. Faculty wedded to a company means advice that’s colored by the interest of the company.</p>
<p>The proposal harms our future because it will stifle innovation. University researchers have a great historical record of disruptive entrepreneurism — for example, Google dates back to a paper from the Stanford digital library project. Smooth transitions from academic research to industrial practice are widely encouraged: most universities allow faculty to consult at 20% time, do year-long sabbaticals in industry, or take leave to start companies in order to promote such transitions. But there’s a big difference between an industrial leave and a long-term commitment. You can’t do disruptive entrepreneurism if 80% of what you do is owned by a big company. Part of the point of being a big company is to control your environment by crushing, containing, or co-opting inconvenient innovations. Faculty who sign on are subject to a huge gravitational force and are <a href="https://newsroom.fb.com/news/2017/12/hard-questions-is-spending-time-on-social-media-bad-for-us/">hard pressed not to annoy the big company they work for</a>.</p>
<p>Like many really dangerous bargains, the harms are diffuse, and the benefits are focused. One kind of benefit is for faculty who sign on: in addition to the higher industrial salaries, working at a big company provides a chance to lead a team of research engineers to execute large-scale projects that may be used by millions. But another, more alarming, benefit is for big companies: all those potentially disruptive or potentially annoying ideas are now owned or controlled by the big company. Perhaps that’s
<del>the point of</del> why management supports the proposal.</p>
<p>If industry really wants to help scale and advance computer science research, it’s easy to do. Do what many companies are already doing, but do much more of it. Give fellowships to graduate students and scholarships to undergraduate students. Employ students as interns. Pay for named chairs and new buildings. Give lots of faculty small amounts of research money. Make and publish open datasets. Give us easy access to industrial scale computing resources. But don’t raid our faculty and tell us it’s good for us.</p>
<p><em>We have made a small edit to clear up a misunderstanding raised by a colleague. We have noted this change with strikethrough. Though comments are closed, you can follow the discussion on <a href="https://twitter.com/beenwrekt/status/1027915117076336640">Twitter</a>, <a href="https://www.reddit.com/r/MachineLearning/comments/963pek/r_you_cannot_serve_two_masters_the_harms_of_dual/">Reddit</a> and <a href="https://news.ycombinator.com/item?id=17734877">Hacker News</a>.</em></p>
Thu, 09 Aug 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/08/09/co-employment/
http://benjamin-recht.github.io/2018/08/09/co-employment/An Outsider's Tour of Reinforcement Learning<h2 id="table-of-contents">Table of Contents.</h2>
<ol>
<li><a href="http://www.argmin.net/2018/01/29/taxonomy/">Make It Happen.</a> Reinforcement Learning as prescriptive analytics.</li>
<li><a href="http://www.argmin.net/2018/02/01/control-tour/">Total Control.</a> Reinforcement Learning as Optimal Control.</li>
<li><a href="http://www.argmin.net/2018/02/05/linearization/">The Linearization Principle.</a> If a machine learning algorithm does crazy things when restricted to linear models, it’s going to do crazy things on complex nonlinear models too.</li>
<li><a href="http://www.argmin.net/2018/02/08/lqr/">The Linear Quadratic Regulator.</a> A quick intro to LQR as why it is a great baseline for benchmarking Reinforcement Learning.</li>
<li><a href="http://www.argmin.net/2018/02/14/rl-game/">A Game of Chance to You to Him Is One of Real Skill.</a> Laying out the rules of the RL Game and comparing to Iterative Learning Control.</li>
<li><a href="http://www.argmin.net/2018/02/20/reinforce/">The Policy of Truth.</a> Policy Gradient is a Gradient Free Optimization Method.</li>
<li><a href="http://www.argmin.net/2018/02/26/nominal/">A Model, You Know What I Mean?</a> Nominal control and the power of models.</li>
<li><a href="http://www.argmin.net/2018/03/13/pg-saga/">Updates on Policy Gradients.</a> Can we fix policy gradient with algorithmic enhancements?</li>
<li><a href="http://www.argmin.net/2018/03/20/mujocoloco/">Clues for Which I Search and Choose.</a> Simple methods solve apparently complex RL benchmarks.</li>
<li><a href="http://www.argmin.net/2018/04/19/pid/">The Best Things in Life Are Model Free.</a> PID control and its connection to optimization methods popular in machine learning.</li>
<li><a href="http://www.argmin.net/2018/04/24/ilc/">Catching Signals That Sound in the Dark.</a> PID for iterative learning control.</li>
<li><a href="http://www.argmin.net/2018/05/02/adp/">Lost Horizons.</a> Relating popular techniques from RL to methods from Model Predictive Control.</li>
<li><a href="http://www.argmin.net/2018/05/11/coarse-id-control/">Coarse-ID Control.</a> Combining high-dimensional statistics and robust optimization for the data-driven control of uncertain systems.</li>
<li><a href="http://www.argmin.net/2018/06/25/rl-tour-fin/">Towards Actionable Intelligence.</a></li>
</ol>
<p><strong>Bonus Post:</strong> <a href="http://www.argmin.net/2018/03/26/performance-profiles">Benchmarking Machine Learning with Performance Profiles</a>. The Five Percent Nation of Atari Champions.</p>
Mon, 25 Jun 2018 00:00:01 +0000
http://benjamin-recht.github.io/2018/06/25/outsider-rl/
http://benjamin-recht.github.io/2018/06/25/outsider-rl/Towards Actionable Intelligence<p>I’m going to close my outsider’s tour of Reinforcement Learning by announcing the release of a <a href="https://arxiv.org/abs/1806.09460">short survey of RL</a> that coalesces my views from the perspectives of continuous control.
Though the RL and controls communities remain practically disjoint, I’ve learned from writing this series that the two have much more to learn from each other than either care to admit. I think that some of the most pressing and exciting open problems in machine learning lie at the intersection of these fields. How do we damp dangerous feedback loops in machine learning systems? How do we build safe autonomous systems that reliably improve human conditions? How do we design systems that automatically adapt to changing environments and tasks? These are all challenges that will only be solved with novel innovations in machine learning <em>and</em> controls.</p>
<p>Perhaps the intersection of machine learning and controls needs a new name so that researchers can stop arguing about territory. I personally am fond of <em>Actionable Intelligence</em> as it sums up not only robotics but smarter, safer analytics. But at the end of the day, I don’t really care what we call the new area: the important part is that there is a large community spanning multiple disciplines that is invested making progress on these problems. Hopefully this tour has set the stage for a lot of great research at the intersection of machine learning and controls, and I’m excited to see what progress the communities can make working together.</p>
<h2 id="unbounded-acknowledgements">Unbounded Acknowledgements</h2>
<p>There are countless individuals who helped to shape the contents of my writing of this blog series and survey. I greatly appreciated the lively debates started on this blog and continued on Twitter. I hope that even those who disagree with my perspectives here find their input incorporated into follow ups here and the survey. Indeed, though most of the material in the survey first appeared on this blog, but for the survey, I’ve dropped the “outsider” bit. Through writing this blog and through the many lively discussions with people inside and outside RL, I feel like I finally understand the nuances of the area and the challenges the field faces moving forward.</p>
<p>I’d like to thank Chris Wiggins for sharing his taxonomy on machine learning, Roy Frostig for shaping my views on direct policy search, Pavel Pravdin for consulting on how to get policy gradient methods up and running, Max Raginsky for perspectives on adaptive control and translations of Russian. I’d like to thank Moritz Hardt, Eric Jonas, and Ali Rahimi for helping to shape the language, rhetoric, and focus of the blog series. I’d also like to thank Nevena Lazic, Gergely Neu, and Stephen Wright for many helpful suggestions for improving the readability and accuracy of the survey. This work was generously supported in part by two forward looking programs at DOD, namely the Mathematical Data Science program at ONR and the Foundations and Limits of Learning program at DARPA.</p>
<p>Additionally, I’d like to thank my other colleagues in machine learning and control for many helpful conversations and pointers: Murat Arcak, Karl Astrom, Francesco Borrelli, John Doyle, Andy Packard, Anders Rantzer, Lorenzo Rosasco, Shankar Sastry, Yoram Singer, Csaba Szepesvari, and Claire Tomlin. I’d also like to thank my colleagues in robotics, Anca Dragan, Leslie Kaebling, Sergey Levine, Pierre-Yves Oudeyer, Olivier Sigaud, Russ Tedrake, and Emo Todorov for sharing their perspectives on what sorts of RL and optimization technology works for them and what challenges they face in their research. Hopefully this survey provides a blueprint for all of these folks and more to begin further collaborations.</p>
<p>I’d like to thank everyone who took CS281B with me in the Spring of 2017 where I first tried to make sense of the problems in learning to control. And most importantly, a big thanks everyone in my research group who has been wrestling with these ideas with me for the past several years. They have have done much of the research highlighted here and have also provided invaluable criticism on my writings here and have shaped my views on this space more than anyone else. In particular, Ross Boczar, Nick Boyd, Sarah Dean, Animesh Garg, Aurelia Guy, Qingqing Huang, Kevin Jamieson, Sanjay Krishnan, Laurent Lessard, Horia Mania, Nik Matni, Becca Roelofs, Ugo Rosolia, Ludwig Schmidt, Max Simchowitz, Stephen Tu, and Ashia Wilson.</p>
<p>Finally, a very special thanks to <a href="http://www.camoncoffee.de/">Camon Coffee</a> in Berlin for letting me haunt their shop while writing. Be sure to stop by next time you’re in Berlin.</p>
Mon, 25 Jun 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/06/25/rl-tour-fin/
http://benjamin-recht.github.io/2018/06/25/rl-tour-fin/Coarse-ID Control<p><em>This is the thirteenth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 14 is <a href="http://www.argmin.net/2018/06/25/rl-tour-fin">here</a>. Part 12 is <a href="http://www.argmin.net/2018/05/02/coarse-id-control/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>Can poor models be used in control loops and still achieve near-optimal performance? In recent posts, we’ve seen the answer is certainly “maybe.” <a href="http://www.argmin.net/2018/02/26/nominal">Nominal control</a> could learn a poor model of the double-integrator with 10 samples and still achieve high performance. Is this optimal for the LQR problem? Is it really just as simple as fitting parameters and treating your estimates as true?</p>
<p>The answer is not entirely clear. To see why, let’s revisit my very fake datacenter model: a three state system where the state $x$ represents the internal temperature of the racks and the control $u$ provides local cooling of each rack. We modeled this dynamical system with a linear model</p>
<script type="math/tex; mode=display">x_{t+1} = Ax_t + Bu_t+w_t</script>
<p>Where</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{bmatrix} 1.01 & 0.01 & 0\\ 0.01 & 1.01 & 0.01 \\ 0 & 0.01 & 1.01 \end{bmatrix}
\qquad \qquad B = I %]]></script>
<p>For $Q$ and $R$, I set $Q = I$ and $R= 1000 I$, modeling that the operator wanted to really reduce the electricity bill.</p>
<p>This example seems to pose a problem for nominal control: note that all of the diagonal entries of the true model are greater than $1$. If we drive the system with noise, the states will grow exponentially, and consequently, you’ll get a fire in your data center. So active cooling must certainly be applied. However, a naive least-squares solution might fit one of the diagonal entries to be less than $1$. Then, since we are placing such high cost on the controls, we might not try to cool that mode too much, and this would lead to a catastrophe.</p>
<p>So how can we include the knowledge that our model is just an estimate and not accurate with a small sample count? My group has been considering an approach to this problem called “Coarse-ID Control,” that tries to incorporate such uncertainty.</p>
<h2 id="coarse-id-ingredients">Coarse-ID Ingredients</h2>
<p>The general framework of Coarse-ID Control consists of the following three steps:</p>
<ol>
<li>Use supervised learning to learn a coarse model of the dynamical system to be controlled. I’ll refer to the system estimate as the <em>nominal system</em>.</li>
<li>Using either prior knowledge or statistical tools like the bootstrap, build probabilistic guarantees about the distance between the nominal system and the true, unknown dynamics.</li>
<li>Solve a <em>robust optimization</em> problem that optimizes control of the nominal system while penalizing signals with respect to the estimated uncertainty, ensuring stable, robust execution.</li>
</ol>
<p>This approach is an example of <em>Robust Control</em>. In robust control, we try to find a controller that works not only for one model, but all possible models in some set. In this case, as long as the true behavior lies in this set of candidate models, we’ll be guaranteed to find a performant controller. The key here is that we are using machine learning to identify not only the plant to be controlled, <em>but the uncertainty as well</em>.</p>
<p>The coarse-ID procedure is well illustrated through the case study of LQR. First, we can estimate $A$ and $B$ by exciting the system with a little random noise, measuring the outcome, and then solving a least-squares problem. We can then guarantee how accurate these estimates are <a href="https://arxiv.org/abs/1802.08334">using some heavy-duty probabilistic analysis</a>. And for those of you out there who smartly don’t trust theory bounds, you can also use a simple bootstrap approach to estimate the uncertainty set. Once we have these two estimates, we can pose a robust variant of the standard LQR optimal control problem that computes a controller that stabilizes all of the models that would be consistent with the data we’ve observed.</p>
<p>Putting all these pieces together, and leveraging some new results in control theory, my students Sarah Dean, Horia Mania, and Stephen Tu, post-doc Nik Matni, and I were able to combine this into the first <a href="https://arxiv.org/abs/1710.01688">end-to-end guarantee for LQR</a>. We derived non-asymptotic bounds that guaranteed finite performance on the infinite time horizon, and were able to quantitatively bound the gap between our solution and the best controller you could design if you knew the model exactly.</p>
<p>To be a bit more precise, suppose in that we have a state dimension $d$ and have $p$ control inputs. Our analysis guarantees that after $O(d+p)$ iterations, we can design a controller that will have low cost on the infinite time horizon. That is, we can guarantee that we stabilize the system (we won’t cause fires) after seeing only a finite amount of data.</p>
<h2 id="proof-is-in-the-pudding">Proof is in the pudding</h2>
<p>Let’s return to the data center problem to see how this does on real data and not just in theory. To solve the robust LQR problem, we end up solving a small semidefinite programming problem as <a href="https://arxiv.org/abs/1710.01688">described in our paper</a>. Though I know that most people are scared to run SDPs, for the size of the problems we consider, these are solved on my laptop in well under a second.</p>
<p>In the plots below we compare nominal control to two versions of the robust LQR problem. The blue line denotes performance when we tell the robust optimization solver what the actual distance is from the nominal model to the true model. The green curve depicts what happens when we estimate this difference between the models using a bootstrap simulation. Note that the green curve is worse, but not that much worse:</p>
<p class="center"><img src="/assets/rl/coarse-id/datacenter_cost_inf_600_iter.png" alt="controller performance" width="250px" />
<img src="/assets/rl/coarse-id/datacenter_stabilizing_600_iter.png" alt="stabilizing" width="250px" /></p>
<p>Note also that the nominal controller does tend to frequently find controllers that fail to stabilize the true system. The robust optimization really helps here to provide controllers that are guaranteed to find a stabilizing solution. On the other hand, in industrial practice nominal control does seem to work quite well. I think a great open problem is to find reasonable assumptions under which the nominal controller is stabilizing. This will involve some hairy analysis of perturbation of Ricatti equations, but it would really help to fill out the picture of when such methods are safely applicable.</p>
<p>And of course, let’s not leave out model-free RL approaches:</p>
<p class="center"><img src="/assets/rl/coarse-id/datacenter_cost_inf_5000_iter.png" alt="controller performance zoom out" width="220px" />
<img src="/assets/rl/coarse-id/datacenter_stabilizing_5000_iter.png" alt="stabilizing zoom out" width="220px" />
<img src="/assets/rl/coarse-id/legend.png" alt="legend" width="110px" /></p>
<p>Here we again see they are indeed far off their model-based counterparts. The x-axis has increased by a factor of 10, and yet even the approximate dynamic approach LSPI is not finding decent solutions. It’s worth remembering that not only are model-free methods sample hungry, but they fail to be safe. And safety is much more critical than sample complexity.</p>
<h2 id="pushing-against-the-boundaries">Pushing against the boundaries</h2>
<p>Since Coarse-ID Control works so well on LQR, I think it’s going to be very interesting to try to push its limits. I’d like to understand how this works on <em>nonlinear</em> problems. Can we propagate parametric uncertainties into control guarantees? Can we model nonlinear problems with linear models and estimate the nonlinear uncertainties? There are a lot of great open problems following up this initial work, and I want to expand on the big set of unsolved problems in the next post.</p>
Fri, 11 May 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/05/11/coarse-id-control/
http://benjamin-recht.github.io/2018/05/11/coarse-id-control/Lost Horizons<p><em>This is the twelfth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 13 is <a href="http://www.argmin.net/2018/05/11/coarse-id-control/">here</a>. Part 11 is <a href="http://www.argmin.net/2018/04/24/ilc/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>This series began by describing a view of reinforcement learning as optimal control with unknown costs and state transitions. In the case where everything is known, we know that dynamic programming generically provides an optimal solution. However, when the models and costs are unknown, or when the full dynamic program is intractable, we must rely on approximation techniques to solve RL problems.</p>
<p>How you approximate the dynamic program is, of course, the hard part. Bertsekas recently released a revised version of his seminal book on <a href="http://web.mit.edu/dimitrib/www/dpchapter.html">dynamic programming and optimal control</a>, and Chapter 6 of Volume 2 has a comprehensive survey of data-driven methods to approximate dynamic programming. Though I don’t want to repeat everything Bertsekas covers here, I think describing his view of the problem builds a clean connection to receding horizon control, and bridges the complementary perspectives of classical controls and contemporary reinforcement learning.</p>
<h2 id="approximate-dynamic-programming">Approximate Dynamic Programming</h2>
<p>While I don’t want to belabor a full introduction to dynamic programming, let me try, in as short a space as possible, to review the basics.</p>
<p>Let’s return to our classic optimal control problem:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{maximize}_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R[x_t,u_t] ]\\
\mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\
& \mbox{($x_0$ given).}
\end{array} %]]></script>
<p>Though we can solve this directly on finite time horizons using some sort of batch solver, there is an often a simpler strategy based on <em>dynamic programming</em> and the <em>principle of optimality</em>: If you’ve found an optimal control policy for a time horizon of length $N$, $\pi_1,\ldots, \pi_N$, and you want to know the optimal strategy starting at state $x$ at time $t$, then you just have to take the optimal policy starting at time $t$, $\pi_t,\ldots,\pi_N$. Dynamic programming then let’s us recursively find a control policy by starting at the final time and recursively solving for policies at earlier times.</p>
<p>On the infinite time horizon, letting $N$ go to infinity, we get a clean statement of the principle of optimality. If we define $V(x)$ to be the value obtained from solving the optimal control problem with initial condition $x$, then we have</p>
<script type="math/tex; mode=display">V(x) = \max_u \mathbb{E}_{e}\left[R[x,u] + V(f(x,u,e))\right]\,.</script>
<p>This equation, known as Bellman’s equation, is almost obvious given the structure of the optimal control problem. But it defines a powerful recursive formula for $V$ and forms the basis for many important algorithms in dynamic programming. Also note that if we have a convenient way to optimize the right hand side of this expression, then we can find the optimal action by finding the $u$ that minimizes the right hand side.</p>
<p>Classic reinforcement learning algorithms like TD and Q-learning take the Bellman equation as a starting point, and try to iteratively solve for the value function using data. These ideas also form the underpinnings of now-popular methods like DQN. I’d again highly recommend Bertsekas’ survey describing the many different approaches one can take to approximately solve this Bellman equation. Rather than covering this, I’d like to use this as jumping off point to compare this viewpoint to that of receding horizon control.</p>
<h2 id="receding-horizon-control">Receding Horizon Control</h2>
<p>As we discussed in the previous posts, 95% of controllers are PID control. Of the remaining 5%, 95% of those are probably based on receding horizon control (RHC). RHC, also known as <em>model predictive control</em> (MPC), is an incredibly powerful approach to controls that marries simulation and feedback.</p>
<p>In RHC an agent makes a plan based on a simulation from the present until a short time into the future. The agent then executes one step of this plan, and then, based on what it observes after taking this action, returns to short-time simulation to plan the next action. This feedback loop allows the agent to link the actual impact of its choice of action with what was simulated, and hence can correct for model mismatch, noise realizations, and other unexpected errors.</p>
<p>Though I have heard MPC referred to as “classical control” whereas techniques like LSTD and Q-learning are more in the camp of “postmodern reinforcement learning,” I’d like to argue that these are just different variants of approximate dynamic programming.</p>
<p>Note that a perfectly valid expression for the value function $V(x_0)$ is the maximal value of the optimization problem</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\max_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R[x_t,u_t] + V(x_{N+1})]\\
\mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\
& \mbox{($x_0$ given).}
\end{array} %]]></script>
<p>Here we have just unrolled the cost beyond one step, but still collect the cost-to-go $N$ steps in the future. Though this is trivial, it is again incredibly powerful: the longer we make the time horizon, the less we have to worry about the value function $V$ being accurate. Of course, now we have to worry about the accuracy of the state-transition map, $f$. But, especially in problems with continuous variables, it is not at all obvious which accuracy is more important in terms of finding algorithms with fast learning rates and short computation times. There is a tradeoff between learning models and learning value functions, and this is a tradeoff that needs to be better understood.</p>
<p>Though RHC methods appear fragile to model mismatch, because they are only as good as the model, the repeated feedback inside RHC can correct for many modeling errors. As an example, it’s very much worth revisiting the robotic locomotion tasks inside the MuJoCo framework. These tasks actually were designed to test the power of a nonlinear RHC algorithm developed by <a href="https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf">Tassa, Erez, and Todorov</a>.</p>
<p>Here’s a video of such a controller in action from the 2012:</p>
<div style="text-align: center">
<iframe width="315" height="315" src="https://homes.cs.washington.edu/~todorov/media/TassaIROS12.mp4" frameborder="0" allowfullscreen=""></iframe></div>
<p>Fast forward to 2:50 to see the humanoid model we discussed in the <a href="http://www.argmin.net/2018/03/20/mujocoloco">random search post</a>. Note that the controller works to keep the robot upright, even when the model is poorly specified. Hence, the feedback inside the RHC loop is providing a considerable amount of robustness to modeling errors. Also note that this demo does not estimate the value function at all. Instead, they simply truncate the infinite time-horizon problem. The receding horizon approximation is already quite good for the purpose of control.</p>
<p>Moreover, the video linked above solves for the controller in 7x real time in 2012. Which is really not bad, and probably with a dedicated engineer, this could be made into real time using up-to-date hardware. However, note that in 2013, the same research group published a <a href="https://homes.cs.washington.edu/~todorov/papers/ErezHumanoids13.pdf">cruder version of their controller that they used during the DARPA robotics challenge</a>. The video here is just as impressive:</p>
<div style="text-align: center">
<iframe width="420" height="315" src="https://homes.cs.washington.edu/~todorov/media/ErezHumanoids13.mp4" frameborder="0" allowfullscreen=""></iframe></div>
<p>All these behaviors were generated by MPC in real-time. The walking is not as what can be obtained from computationally intensive long-horizon trajectory optimization, but it looks considerably better than the sort of direct policy search gaits <a href="http://www.argmin.net/2018/03/20/mujocoloco">we discussed a previous post</a>.</p>
<h2 id="learning-in-rhc">Learning in RHC</h2>
<p>Is there a middle ground between expensive offline trajectory optimization and real time model-predictive control? I think the answer is yes in the very same way that there is middle ground between learning dynamical models and learning value functions. Performance of a receding control system can be improved by better modeling of the value function which defines the terminal cost. The better a model you make of the value function, the shorter a time horizon you need for simulation, and the closer you get to real-time operation. Of course, if you had a perfect model of the value function, you could just solve the Bellman equation and you would have the optimal control policy. But by having an approximation to the value function, high performance can still be extracted in real-time.</p>
<p>So what if we <em>learn</em> to iteratively improve the value function while running RHC? This idea has been explored in a project by my Berkeley colleagues <a href="https://arxiv.org/abs/1610.06534">Rosolia, Carvalho, and Borrelli</a>. In their “Learning MPC” approach, the terminal cost is learned by nearest neighbors. The terminal cost of a state is the value obtained last time you tried that state. If you haven’t visited that state, the cost is infinite. This formulation constrains the terminal condition to be in a state observed before. You can explore new ways to decrease your cost on the finite time horizon as long as you reach a state that you have already demonstrated is safe.</p>
<p>This nearest-neighbors approach to control works really well in practice. Here’s an demo of the method on an RC car:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/4kHDv9senpE" frameborder="0" allowfullscreen="" class="center"></iframe>
<p>After only a few laps, the learned controller works better than a human operator. Simple nearest-neighbors suffices to learn rather complex autonomous actions. And, if you’re into that sort of thing, you can even prove monotonic increase in control performance. Quantifying the actual learning rate remains open and would be a great problem for RL theorists out there to study. But I think this example cleanly shows how the gap between RHC methods and Q-learning methods is much smaller than it first appears.</p>
<h2 id="safety-while-learning">Safety While Learning</h2>
<p>Another reason to like this blended RHC approach to learning to control is that one can hard code in constraints on controls, states, and easily incorporate models of disturbance directly into the optimization problem. Some of the most challenging problems in control are how to execute safely while continuing to learn more about a system’s capability, and an RHC approach provides a direct route towards balancing safety and performance. <a href="http://www.argmin.net/2018/05/11/coarse-id-control/">In the next post</a>, I’ll describe an optimization-based approach to directly estimate and incorporate modeling errors into control design.</p>
Wed, 02 May 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/05/02/adp/
http://benjamin-recht.github.io/2018/05/02/adp/Catching Signals That Sound in the Dark<p><em>This is the eleventh part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 12 is <a href="http://www.argmin.net/2018/05/02/adp/">here</a>. Part 10 is <a href="http://www.argmin.net/2018/04/19/pid/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>The essence of reinforcement learning is using past data to enhance the future manipulation of a system that dynamically evolves over time. The most common practice of reinforcement learning follows the <em>episodic</em> model, where a set of actions is proposed and tested on a system, a series of rewards and states are observed, and this combination of previous action and reward and state data are combined to improve the action policy. This is a rich and complex model for interacting with a system, and brings with it considerably more complexity than in standard stochastic optimization settings. What’s the right way to use all of the data that’s collected in order to improve future performance?</p>
<p>Methods like policy gradient, random search, nominal control, and Q-learning each transform the reinforcement learning problem into a specific oracle model and then derive their analyses using this model. In policy gradient and random search, we transform the problem into a zeroth-order optimization problem and use this formulation to improve the cost. Nominal control turns the problem into a model estimation problem. But are any of these methods more or less efficient than each other in terms of extracting the most information per sample?</p>
<p>In this post, I’m going to describe an iterative learning control (ILC) scheme that uses past data in an interesting way. And its roots go back to the simple PID controller we discussed in the last post.</p>
<h2 id="pid-control-for-iterative-learning-control">PID control for iterative learning control</h2>
<p>Consider the problem of getting a dynamical system to track a fixed time series. That is, we’d like to construct some control input $\mathbf{u} = [u_1,\ldots,u_N]$ so that the output of the system is as close to $\mathbf{v} = [v_1,\ldots,v_N]$ as possible (I’ll use bold letters to describe sequences). Here’s an approach that looks a lot like reinforcement learning: let’s feedback the error in our tracker to build the next control. We can define the error signal to be the difference $\mathbf{e} = [v_1-y_1, \ldots,v_n-y_N]$. Then let’s denote the discrete integral (cumulative sum) of $\mathbf{e}$ as $\mathcal{S} \mathbf{e}$. And let’s denote the discrete derivative as $\mathcal{D}\mathbf{e}$. Then we can define a PID controller over trajectories as</p>
<script type="math/tex; mode=display">\mathbf{u}_{\mathrm{new}} =
\mathbf{u}_{\mathrm{old}} + k_P \mathbf{e} + k_I \mathcal{S} \mathbf{e} + k_D \mathcal{D} \mathbf{e}\,.</script>
<p>Note that these derivatives and integrals are computed on the sequence $e$, but are not a function of older iterations. In this sense, this particular scheme for ILC is different than classical PID, but it is building upon the same primitives.</p>
<p>This scheme is what most controls researchers think of when they hear the term “iterative learning control.” I like to take a more encompassing view of ILC, <a href="http://www.argmin.net/2018/02/14/rl-game">as I described in a previous post</a>: ILC is any control design scheme where a controller is improved by repeating a task multiple times, and using previous repetitions to improve control performance. In that sense, ILC and episodic reinforcement learning are two different terms for the same problem. But the most classical example of this scheme in controls is the PID-type method I described above.</p>
<p>Note that this is using a ton of information about the previous trajectory to shape the next trajectory. Even though I am designing an open loop policy, I am using far more than reward information alone in constructing the policy.</p>
<p>How well does this work? Let’s use the simple quadrotor model we’ve been using, this time with some added friction to make it a bit more realistic. So the true dynamics will be two independent systems of the form</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
x_{t+1} &= Ax_t + Bu_t\\
y_t &= Cx_t
\end{aligned} %]]></script>
<p>with</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{bmatrix}
1 & 1 \\ 0 & 0.9
\end{bmatrix}\,,~~ B=\begin{bmatrix} 0\\1\end{bmatrix}\,,~~\mbox{and}~~C=\begin{bmatrix} 1 & 0 \end{bmatrix} %]]></script>
<p>Let’s get this system to track a trajectory <em>without using the model</em>. That is, let’s use iterative learning control to learn to track some curve in space without ever knowing what the true model of the system is. To get a target trajectory, I made the following path with my mouse:</p>
<p class="center"><img src="/assets/rl/ilc/target.png" alt="target trajectory" width="240px" /></p>
<p>For ILC, let’s use the PID controller setup above. I’m actually only going to use the derivative term, setting $k_D = 0.1$ and the rest of the terms to $0$. Then I get the following performance for the first 8 iterations.</p>
<p class="center"><img src="/assets/rl/ilc/8_iter.png" alt="8 iterations" width="560" /></p>
<p>And this is what the trajectory looks like after 20 repetitions:</p>
<p class="center"><img src="/assets/rl/ilc/20_iter.png" alt="20 iterations" width="240px" /></p>
<p>Not bad! This converges really quickly, and using all of the state information finds a control policy even without positing a model in very few iterations. Again, the update is the “D”-control update above, and this never uses any knowledge of the true dynamics that govern the system. Amazingly, there is no need for 100K episodes to get this completely model-free method to converge to a quality solution. For the curious, <a href="https://nbviewer.jupyter.org/url/argmin.net/code/ILC_tracker.ipynb">here’s the code to generate these plots in a python notebook</a></p>
<h2 id="stochastic-approximation-in-sheeps-clothing">Stochastic approximation in sheep’s clothing</h2>
<p>Why does this work? In this case, because everything is linear, we can actually analyze the ILC scheme in a simple way. Note that because the dynamics are linear, there is some matrix $\mathcal{F}$ that takes the input and produces the output. That’s what “linear” dynamics means, right?</p>
<p>Also, note that both $\mathcal{S}$ and $\mathcal{D}$ are linear maps so we can think of them as matrices as well. So suppose we knew in advance the optimal control input $u_\star$ such that $v=\mathcal{F} \mathbf{u}_\star$. Then, with a little bit of algebra, we can rewrite the PID iteration as</p>
<script type="math/tex; mode=display">\mathbf{u}_{\mathrm{new}} -\mathbf{u}_\star= \left\{I +(k_P I + k_I \mathcal{S} + k_d \mathcal{D}) \mathcal{F}\right\} (\mathbf{u}_{\mathrm{old}} -\mathbf{u}_\star)\,.</script>
<p>If the matrix in curly brackets has eigenvalues less than $1$, then this iteration converges linearly to the optimal control input. Indeed, with the choice of parameters I used in my examples, I actually made the update map into a contraction mapping, and this explains why the performance looks so good after 8 iterations.</p>
<p>This is a cute instance of <em>stochastic approximation</em> that does not arise from following the gradient of any cost function: we are trying to find a solution of the equation $v = F u$, and our iterative algorithm for doing so uses the classic <a href="https://en.wikipedia.org/wiki/Stochastic_approximation">Robbins-Monro method</a>. But it has a very different flavor than what we typically encounter in stochastic gradients. For the experts out there, the matrix in the parentheses is lower triangular, and hence is never positive definite.</p>
<p>I actually think there a lot of great questions to answer even for this simple linear case: Which dynamics admit efficient ILC schemes? How robust is this method to noise? Can we use this method to solve problems more complex than trajectory tracking? It also shows that there are lots of ways to use your data in reinforcement learning, and there are far more options out there than might appear.</p>
Tue, 24 Apr 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/04/24/ilc/
http://benjamin-recht.github.io/2018/04/24/ilc/The Best Things in Life Are Model Free<p><em>This is the tenth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 11 is <a href="http://www.argmin.net/2018/04/24/ilc/">here</a>. Part 9 is <a href="http://www.argmin.net/2018/03/20/mujocoloco/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>Though I’ve spent the last few posts casting shade at model-free methods for reinforcement learning, I am not blindly against the model-free paradigm. In fact, the most popular methods in core control systems are model free! The most ubiquitous control scheme out there is PID control, and PID has only three parameters. I’d like to use this post to briefly describe PID control, explain how it is closely connected to many of the most popular methods in machine learning, and then turn to explain what PID brings to the table over the model-free methods that drive contemporary RL research.</p>
<h2 id="pid-in-a-nutshell">PID in a nutshell</h2>
<p>PID stands for “proportional integral derivative” control. The idea behind PID control is pretty simple: suppose you have some dynamical system with a single input that produces a single output. In controls, we call the system we’d like to control <em>the plant</em>, a term that comes from chemical process engineering. Let’s say you’d like the output of your plant to read some constant value $y_t = v$. For instance, you’d like to keep the water temperature in your espresso machine at precisely <a href="http://espressovivace.com/education/espresso-tips/">203 degrees Fahrenheit</a>, but you don’t have a precise differential equation modeling your entire kitchen. PID control works by creating a control signal based on the error $e_t=v-y_t$. As the name implies, the control signal is a combination of error, its derivative, and its integral:</p>
<script type="math/tex; mode=display">u_t = k_P e_t + k_I \int_0^t e_s ds + k_D \dot{e}_t\,.</script>
<p>I’ve heard differing accounts, but somewhere in the neighborhood of <a href="https://pdfs.semanticscholar.org/5d1a/2f4b06bc4e5714be1948099c2cb7b3236d42.pdf#page=177">95 percent</a> of all control systems are PID. And some suggest that the number of people using the “D” term is negligible. Something like 95 percent of the myriad collection of control processes that keep our modern society running are configured by setting <em>two</em> parameters. This includes those <a href="https://home.lamarzoccousa.com/history-of-the-pid/">third wave espresso machines</a> that fuel so much great research.</p>
<p class="center"><img src="/assets/rl/pid/silvia-pid.jpg" alt="get that temp stable" height="240px" />
<img src="/assets/rl/pid/PIDGraph.png" alt="oscillating" height="240px" /></p>
<p>In some sense, PID control is the “gradient descent” of control: it solves most problems and fancier methods are only needed for special cases. The odd thing about statistics and ML research these days is that everyone knows about gradient descent, but almost none of the ML researchers I’ve spoken to know anything about PID control. So perhaps to explain the ubiquity of PID control to the ML crowd, it might be useful to establish some connections to gradient descent.</p>
<h2 id="pid-in-discrete-time">PID in discrete time</h2>
<p>Before we proceed, let’s first make the PID controller digital. We all design their controllers in discrete time rather than continuous time since we do things on computers. How can we discretize the PID controller? First, we can compute the integral term with a running sum:</p>
<script type="math/tex; mode=display">w_{t+1} = w_t + e_t</script>
<p>When $w_0=0$, then $w_t$ is the sum of the sequence $e_s$ for $s<t$.</p>
<p>The derivative term can be approximated by finite differences. But since taking derivatives can amplify noise, most practical PID controllers actually filter the derivative term to damp this noise. A simple way to filter the noise is to let the derivative term be a running average:</p>
<script type="math/tex; mode=display">v_{t} = \beta v_{t-1} + (1-\beta)(e_t-e_{t-1})\,.</script>
<p>Putting everything together, a PID controller in discrete time will take the form</p>
<script type="math/tex; mode=display">u_t = k_P e_t + k_I w_t + k_D v_t</script>
<h2 id="integral-control">Integral Control</h2>
<p>Let’s now look at pure integral control. We can simplify the controller in this case to one simple update formula:</p>
<script type="math/tex; mode=display">u_t = u_{t-1}+k_i e_t\,.</script>
<p>This should look very familiar to all of my ML friends out there as it looks an <em>awful lot</em> like gradient descent. To make the connection crisp, suppose that the plant we’re trying to control takes an input $u$ and then spits out the output $y= f’(u)$ for some fixed function $f$. If we want to drive $y_t$ to zero, then the error signal $e$ takes the form $e = -f’(u)$. With this model of the plant, integral control <em>is</em> gradient descent. Just like in gradient descent, integral control can never give you the wrong answer. If you converge to a constant value of the control parameter, then the error must be zero.</p>
<h2 id="proportional-integral-control">Proportional Integral Control</h2>
<p>As discussed above, PI control is the most ubiquitous form of control. For optimization, it is less common, but still finds a valid algorithm when $e = -f’(u)$.</p>
<p>Doing a variable substitution, the $PI$ controller will take the form</p>
<script type="math/tex; mode=display">u_{t+1} = u_t + (k_I-k_P) e_t + k_P e_{t+1}</script>
<p>If $e_t = -f’(u_t)$, then we get the algorithm:</p>
<script type="math/tex; mode=display">u_{t+1} + k_P f'(u_{t+1}) = u_t - (k_I-k_P) f'(u_t)</script>
<p>This looks a bit tricky as somehow we need to compute the gradient of $f$ at our current time step. However, optimization friends out there will note that this equation is the optimality conditions for the algorithm</p>
<script type="math/tex; mode=display">u_{t+1} = \mathrm{prox}_{k_P f} ( u_t - (k_I-k_P) f'(u_t) )\,.</script>
<p>Hence, PI control combines a gradient step with a proximal step. The algorithm is a hybrid between the classical proximal point method and gradient descent. Note that if this method converges, it will again converge to a point where $f’(u)=0$.</p>
<h2 id="proportional-integral-derivative-control">Proportional Integral Derivative Control</h2>
<p>The master algorithm is PID. What happens here? Allow me to do a clever change of variables that <a href="http://www.laurentlessard.com/">Laurent Lessard</a> showed to me. Define the auxiliary variable</p>
<script type="math/tex; mode=display">x_t = \frac{1}{1-\beta}w_t+\frac{\beta}{(1-\beta)^3}v_t-\frac{\beta}{(1-\beta)^2}e_t\,.</script>
<p>In terms of this new hidden state, $x_t$, the PID controller reduces to the tidy set of equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
x_{t+1} &= (1+\beta)x_t -\beta x_{t-1} + e_t\\
u_t &= C_1 x_t + C_2 x_{t-1}+ C_3 e_t\,,
\end{aligned} %]]></script>
<p>and the coefficients $C_i$ are given by the formulae:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
C_1 &= -(1-\beta)^2 k_D+k_I\\
C_2 &= (1-\beta)^2 k_D-\beta k_I\\
C_3 &= k_P + (1-\beta) k_D
\end{aligned} %]]></script>
<p>The $x_{t}$ sequence looks like a <em>momentum</em> sequence used in optimization. Indeed, with proper settings of the gains, we can recover a variety of algorithms that we commonly use in machine learning. <em>Gradient descent with momentum</em> with learning rate $\alpha$—-also known as the <em>Heavy Ball method</em>—is realized with the settings.</p>
<script type="math/tex; mode=display">k_I = \frac{\alpha}{1-\beta}\,, ~~~~
k_D=\frac{\alpha \beta}{(1-\beta)^3} \,, ~~~~
k_P = \frac{-\alpha \beta}{(1-\beta)^2}</script>
<p>Nesterov’s accelerated method pops out when we set</p>
<script type="math/tex; mode=display">k_I = \frac{\alpha}{1-\beta} ~~~~ k_D=\frac{\alpha \beta^2}{(1-\beta)^3}~~~~ k_P = \frac{-\alpha \beta^2}{(1-\beta)^2}</script>
<p>These are remarkably similar, differing only in the power of $\beta$ in the numerator of the proportional and derivative terms.</p>
<h2 id="the-lure-problem">The Lur’e Problem</h2>
<p>Laurent blew my mind when he showed me the connection between PID control and optimization algorithms. How crazy is it that most of the popular algorithms in ML end up being special cases of PID control? And I imagine that if we went out and did surveys of industrial machine learning, we’d find that 95% of the machine learning models in production were trained using some sort of gradient descent. Hence, there’s yet another feather in the cap for PID.</p>
<p>It turns out that the problem of feedback with a static, nonlinear map has a long history in controls, and this problem even has a special name: <a href="https://en.wikipedia.org/wiki/Nonlinear_control#Nonlinear_feedback_analysis_%E2%80%93_The_Lur'e_problem">the Lur’e problem</a>. Finding a controller to push a static nonlinear system to a fixed point turns out to be identical to designing an optimization algorithm to set a gradient to zero.</p>
<p class="center"><img src="/assets/rl/pid/lureloop.png" alt="parallels between optimization and control" width="560px" /></p>
<p>Laurent Lessard, Andy Packard, and I made these connections in <a href="https://arxiv.org/abs/1408.3595">our paper</a>, showing that many of the rates of convergence for optimization algorithms could be derived using stability techniques from controls. We also used this approach to show that the Heavy Ball method might not always converge at an accelerated rate, justifying why we need the slightly more complicated Nesterov accelerated method for reliable performance. Indeed, we found settings where the Heavy Ball method for quadratics converged linearly, but on general convex functions didn’t converge at all. Even though these methods barely differ from each other in terms of how you set the parameters, this subtle change is the difference between convergence and oscillation!</p>
<p class="center"><img src="/assets/rl/pid/hbcycle.png" alt="Heavy Ball isn’t stable" width="560px" /></p>
<p>With Robert Nishihara and Mike Jordan, we followed up this work showing that you could even use this to <a href="https://arxiv.org/abs/1502.02009">study ADMM using the connections between prox-methods and proportional integral control</a>. Bin Hu, Pete Seiler, and Anders Rantzer <a href="https://arxiv.org/abs/1706.08141">generalized this technique to better understand stochastic optimization methods</a>. And Laurent and Bin <a href="https://arxiv.org/abs/1703.01670">made the formal connections to PID control</a> that I discuss in this post.</p>
<h2 id="learning-to-learn">Learning to learn</h2>
<p>With the connection to PID control in mind, we can think of learning rate tuning as controller tuning. The Nichols-Ziegler rules (developed in the forties) simply find the largest gain $k_P$ such that the system oscillates, and set the PID parameters based on this gain and the frequency of the oscillations. A common trick for gradient descent tuning is to find the largest value such that gradient descent does not diverge, and then set the momentum and learning rate accordingly from this starting point.</p>
<p>Similarly, we can think of the “learning to learn” paradigm in machine learning as a special case of controller design. Though PID works for most applications, it’s possible that a more complex controller will work for a particular application. In the same vein, it’s always possible that there’s something better than Nesterov’s method if you restrict your set of instances. And maybe you can even find this controller by gradient descent. But it’s always good to remember, 95% is still PID.</p>
<p>I make these connections for the following reason: both in the case of gradient descent and PID control, we can only prove reasonable behavior in rather constrained settings: in PID we understand how to analyze certain nonlinear control systems, but not all of them. In optimization, we understand the behavior on convex functions and problems that are “nearly” convex. Obviously, we can’t hope to have simple methods to stabilize <em>all</em> possible plants/functions (or else we’re violating some serious conjectures in complexity theory), but we can show that our methods work on simple cases, and performance degrades gracefully as we add complexity to the problem.</p>
<p>Moreover, the simple cases give us a body of techniques for general design: by developing theory on specific cases we can developing intuition and probe fundamental limits. I think the same thing needs to be established for general reinforcement learning, and it’s why I’ve been spending so much time on LQR and nearby generalizations.</p>
<p>Let’s take this perspective for PID. Though PID is a powerful workhorse, it is typically thought of to only be useful for simple low-level control loops attempting to maintain some static equilibrium. It seems like it’s not particularly useful for more complex tasks like robotic acrobatics. However, <a href="http://www.argmin.net/2018/04/24/ilc/">in the next post, I will describe a more complex control task that can also be solved by PID-type techniques.</a></p>
Thu, 19 Apr 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/04/19/pid/
http://benjamin-recht.github.io/2018/04/19/pid/The Ethics of Reward Shaping<p>I read three great articles over the weekend by <a href="http://twitter.com/noUpside">Renee DiResta</a>, <a href="http://www.columbia.edu/~chw2/">Chris Wiggins</a>, and <a href="https://twitter.com/janellecshane">Janelle Shane</a> that touched on a topic that’s been troubling me: In machine learning, we take our cost functions for granted, amplifying feedback loops with horrible unintended consequences.</p>
<p>First, Renee DiResta makes a great case <a href="https://www.wired.com/story/creating-ethical-recommendation-engines/">for a complete reinvention of how we design and deploy recommendation engines</a>. Recommender Systems always seemed like an innocuous and low-stakes ML application. What harm could come from improving music systems to tell people they might like more than the Beatles, or improving the suggestions on a streaming service like Netflix? They might improve the user experience a little bit, but probably would never amount to much. This assessment couldn’t have been more wrong: as Zeynep Tufekci summarizes: recommendation systems have become the internet’s <a href="https://www.nytimes.com/2018/03/10/opinion/sunday/youtube-politics-radical.html">“Great Radicalizer”</a>, focusing minds on ever-increasingly extreme content to keep them hooked on websites.</p>
<p>DiResta argues that we have to change the cost function we optimize to bring recommender systems in line with ethical guidelines. Optimizing time spent is clearly the wrong objective. I know that engineers are not deliberately trying to incite rage and panic in their user base, but the signals they use to evaluate user happiness are completely broken. “Time on the website” is not the right performance indicator. But what exactly is the right way to quantify “user happiness?” This is super hard to make into a cost function for an optimization problem, as Chris Wiggins lays out in his <a href="http://datascience.columbia.edu/ethical-principles-okrs-and-kpis-what-youtube-and-facebook-could-learn-tukey">thoughtful blog post</a>. Wiggins argues that we can never construct the correct cost function, but we can iteratively design the cost to match ethical concerns. Wiggins suggests that industrial applications that face humans should consider the same principles as academic researchers working with human subjects, laid out in the famous <a href="https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html">Belmont Report</a>. Once we set these guidelines as gold standards, engineers can treat these standards as design principles for shippable code. We can constantly refine and improve our models to make sure they adhere to these principles.</p>
<h2 id="shaping-rewards-is-hard">Shaping rewards is hard</h2>
<p>I can’t emphasize enough that even in “hard engineering” that doesn’t involve people, designing cost functions is a major challenge and tends to be an art form in engineering. Janelle Shane wrote a <a href="http://aiweirdness.com/post/172894792687/when-algorithms-surprise-us">creative and illuminating blog</a> on how “AI systems” that are designed to optimize cost functions often surprise us with unexpected behavior that we didn’t think to discount. Shane highlights several particularly bizarre examples of systems that fall over rather than walk, or force adversaries into segmentation faults. The underlying issue in all of these problems is that if we define the reward function too loosely and don’t add the correct safety constraints, optimized systems will frequently take surprising and unwanted paths to optimality.</p>
<p>This is indeed a question that underlies my series on reinforcement learning. We saw this phenomenon in the <a href="http://www.argmin.net/2018/03/20/mujocoloco/">post about locomotion in MuJoCo</a>. In the Open AI Gym, humanoid walking is declared “solved” if the reward value exceeds 6000. This lets you just look at scores (as if you’re a gamer or a day trader on wall street), and completely ignore anything you might know about robotics. If the number is high enough, you win. But I showed a bunch of gaits that achieve the target reward, and none of these look like plausible actions that could happen in the physical world. All of them have overfit to defects in the simulation engine that are unrealistic.</p>
<p>It’s also rather unclear what the right reward function is for walking. There are so many things that we value in a walking robot. But these values are modeling assumptions and are often not correct in retrospect. In order to get any optimization-based framework to output realistic locomotion, cost functions have to be defined iteratively until the behavior matches as many of our expectations as possible.</p>
<h2 id="ml-systems-are-now-rl-systems">ML systems are now RL systems</h2>
<p>Though it’s not obvious, Shane’s surprising optimizers are closely connected to the bad behavior of recommender systems highlighted by DiResta and Wiggins. <strong>As soon as a machine learning system is unleashed in feedback with humans, that system is a reinforcement learning system, not a machine learning system.</strong></p>
<p>This poses a major challenge to the ML community, and it’s why I’ve shifted my academic focus to strongly to RL. Supervised learning tells essentially nothing about how to deal with changing distributions, gaming, adversarial behavior, and unexpected amplification. We’re at the point now where all machine learning is reinforcement learning, and yet we don’t understand reinforcement learning at all! This is a huge issue that we all have to tackle if we want our learning systems to be trustable, predictable, and safe.</p>
<h2 id="reward-shaping-is-not-a-dirty-word">Reward shaping is not a dirty word</h2>
<p>Cost function design is a major challenge in throughout engineering. And it’s a major challenge when establishing laws and policy as well. Across a variety of disciplines: performance indicators must be refined iteratively until the behavior matches our desiderata.</p>
<p>And ethical standards can be part of this desiderata. James Grimmelmann put it well <a href="https://www.washingtonpost.com/news/the-switch/wp/2018/04/11/ai-will-solve-facebooks-most-vexing-problems-mark-zuckerberg-says-just-dont-ask-when-or-how/">“Kicking the question over to AI just means hiding value judgments behind the AI.”</a> ML engineers have to accept that their engineering has moral and ethical outcomes, and hence they must design with these outcomes in mind. Algorithms can be tuned to match our societal values, and it’s time for our community to achieve a consensus on how.</p>
Mon, 16 Apr 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/04/16/ethical-rewards/
http://benjamin-recht.github.io/2018/04/16/ethical-rewards/Benchmarking Machine Learning with Performance Profiles<p>A common sticking point in contemporary reinforcement learning is how to evaluate performance on benchmarks. For a general purpose method, we’d like to demonstrate aptitude on a wide selection of test problems with minimal special case tuning. A great example of such a suite of test problem is the Arcade Learning Environment (ALE) of Atari benchmarks. How can we tell when an algorithm is “state-of-the-art” on Atari? Clearly, we can’t just excel on one game. There are 60 games, and even careful comparisons end in impenetrable tables with 60 rows and multiple columns. Moreover, the performance is a random variable as the methods are evaluated over many random seeds, so there are inherent uncertainties in the reported numbers. How can we summarize the performance over such a large number of noisy benchmarks?</p>
<h2 id="performance-profiles">Performance Profiles</h2>
<p>My favorite way to aggregate benchmarks was proposed by <a href="https://arxiv.org/abs/cs/0102001">Dolan and More</a> and called <em>performance profiles</em>. The idea here is very simple. We want a way of depicting how frequently is a particular method within some distance of the best method for a particular problem instance. To do so, we make statistics. Let’s suppose we have a suite of $n_p$ problem instances and we want to find the best performing method across all of these instances.</p>
<p>For each problem instance, we compute the best method, and then for every other method, we determine how far they are from optimal. This requires some notion of “far from optimality.” Let’s denote $d[m,p]$ the distance from optimality of method m on problem p.
We then count on how many problem instances a particular method is within a factor of tau of the optimal. That is, we compute</p>
<script type="math/tex; mode=display">% <![CDATA[
\rho_m(\tau) = \frac{1}{n_p} \left| \{p~:~d[m,p] < \tau \}\right|\,. %]]></script>
<p>That is, we compute the fraction of problems where method m has distance from optimality less than tau.</p>
<p>A performance profile plots $\rho_m(\tau)$ for each method m. Performance profiles provide a visually striking way to immediately eyeball differences in performance between a set of candidate methods over a suite of benchmarks. They let you easily read off the percentage of times a method is within some set range of optimal across the suite of benchmarks. Moreover, they have several nice properties: performance profiles are robust to outlier problems. They are also robust to small changes in performance across all problems. Performance profiles allow a holistic view of performance without having to single out the idiosyncrasies of particular instances.</p>
<p>The canonical application for performance profiles is for comparing solve times of different optimization methods. In this case, distance from optimality will be the ratio of the time a solver takes to the time taken by the fastest on a particular instance. The original Dolan and More paper has several examples showing that performance profiles cleanly delineate aggregate differences in run times for different solvers. They are now a widely adopted convention for comparing optimization methods. As we will now see, performance profiles also provide a straightforward way to compare relative rewards in reinforcement learning problems.</p>
<h2 id="is-deep-rl-better-than-handcrafted-representations-on-atari">Is Deep RL better than handcrafted representations on Atari?</h2>
<p>Let’s apply performance profiles to understand the power of deep reinforcement learning on Atari games. One of my favorite deep reinforcement learning papers is <a href="https://arxiv.org/abs/1709.06009">“Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents”</a> by Machado et al. which proposes several guidelines for conducting careful evaluations of methods on the ALE benchmark suite. When put on the same footing under their evaluation framework, DQN doesn’t look to be that much better than SARSA (a simple method for Q-learning with function approximation) and hand crafted features.</p>
<p>Nonetheless, the authors concede that “Despite this high sample complexity, DQN and DQN-like approaches remain the best performing methods overall when compared to simple, hand-coded representations.” But it’s hard to tell how much better DQN is. The evaluations are stochastic, and since DQN is costly, they only evaluate it’s performance on 5 random seeds and report the mean and standard deviation.</p>
<p>I downloaded the source of the Machado paper and parsed the results tables into a CSV file<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. This table lists the mean reward and standard deviation for each game evaluated. Not only are the rewards here random variables, but directly comparing the means is difficult because the rewards are all on completely different scales.</p>
<p>To attempt to address both the stochasticity and the varied scaling of of the rewards, I decided to use p-values from the Welsh t-test. That is $d[m,p]$ is the negative log probability that method $m$ has a higher score than the best method on problem $p$ under the assumptions of the Welch t-test. For the best performing method, I assign $d[m,p]=0$.</p>
<p>Now, this is a <em>very</em> imperfect measure. T-tests are assuming Gaussian distributions, and that’s clearly not going to be legitimate. But it’s not a terrible comparison when we are only provided means and variances. And, frankly, the community might want to consider releasing more finely detailed reports of their experiments if they would like better evaluation of the relative merits of methods. For example, if researchers simply released the raw scores for all runs, we could try more sophisticated nonparametric rank tests.</p>
<p>Let’s leave the imperfection aside for a moment, and plot a performance profile based on these likelihoods.I computed a standard performance profile for the ALE benchmark suite, plotting the frequency of the time that the p-values are greater than some threshold $\tau$. The results are here:</p>
<p class="center"><img src="/assets/rl/perfprof/perf_prof.png" alt="you are all crazy, shallow learning is as good as deep learning for atari" width="480px" /></p>
<p>For any $x$ value, the $y$-value is the number of instances where a method either has the highest mean or where we cannot reject the null hypothesis that the method has the highest mean with confidence $\tau$. You might look at this plot and think “that’s completely unreadable as the curves are on top of each other.” When performance profiles intersect each other multiple times, it means the algorithms are effectively equivalent to each other: there is no value of $\tau$ where DQN or Blob-PROST are more frequently scoring higher than the other. To see an example of curves where things are way off, consider Blob-Prost with 200M simulations vs DQN with 10M simulations:</p>
<p class="center"><img src="/assets/rl/perfprof/perf_prof2.png" alt="these two algorithms are not the same" width="480px" /></p>
<p>Now there is a clear separation in the performance profiles, and it’s clear that BlobProst 200M is much better than DQN 10M. This shouldn’t be surprising as I’m letting BlobProst see 20x as many samples. But it does suggest that DQN and Blob-PROST when given the same sample allocation are essentially indistinguishable methods. My take away from this plot is that Machado et al. concede too much in their discussion: <strong>simple methods and hand crafted features match the performance of DQN on the ALE.</strong></p>
<h2 id="to-establish-dominance-provide-more-evidence">To establish dominance, provide more evidence.</h2>
<p><a href="https://twitter.com/Miles_Brundage/status/977512294824341504">Miles Brundage</a> suggests that there are far better baselines now (from the DeepMind folks). I’d like to make the modest suggestion that someone at DeepMind adopt the Machado et al. evaluation protocol for these new, more sophisticated methods, and then report means and standard deviations on all of the games. Even better, why not report the actual values over the runs so we could use non-parametric test statistics? Or even better, why not release the code? I’d be happy to make a performance profile again so we can see how much we’re improving.</p>
<p>If you are interested in changing the performance metric or running performance profiles on your own data, here’s a <a href="https://nbviewer.jupyter.org/url/argmin.net/code/atari_performance_profiles.ipynb">Jupyter notebook</a>. that lets you recreate the above plots.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There was no data for Blob-PROST on Journey Escape with 200M samples, so I used the values listed for 100M samples. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 26 Mar 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/03/26/performance-profiles/
http://benjamin-recht.github.io/2018/03/26/performance-profiles/Clues for Which I Search and Choose<p><em>This is the ninth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 10 is <a href="http://www.argmin.net/2018/04/19/pid/">here</a>. Part 8 is <a href="http://www.argmin.net/2018/03/13/pg-saga/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>Before we leave these model-free chronicles behind, let me turn to the converse of the Linearization Principle. We have seen that random search works well on simple linear problems and appears better than some RL methods like policy gradient. Does random search break down as we move to harder problems? <strong>Spoiler Alert: No.</strong> But keep reading!</p>
<p>Let’s apply random search to problems that are of interest to the RL community. The deep RL community has been spending a lot of time and energy on a suite of benchmarks, maintained by <a href="https://gym.openai.com/envs/#mujoco">OpenAI</a> and based on the <a href="http://www.mujoco.org/">MuJoCo</a> simulator. Here, the optimal control problem is to get the simulation of a legged robot to walk as far and quickly as possible in one direction. Some of the tasks are very simple, but some are quite difficult like the complicated humanoid models with 22 degrees of freedom. The dynamics of legged robots are well-specified by Hamiltonian Equations, but planning locomotion from these models is challenging because it is not clear how to best design the objective function and because the model is piecewise linear. The model changes whenever part of the robot comes into contact with a solid object, and hence a normal force is introduced that was not previously acting upon the robot. Hence, getting robots to work without having to deal with complicated nonconvex nonlinear models seems like a solid and interesting challenge for the RL paradigm.</p>
<p>Recently, <a href="https://arxiv.org/abs/1703.03864">Salimans and his collaborators at Open AI</a> showed that random search worked quite well on these benchmarks. In particular, they fit neural network controllers using random search with a few algorithmic enhancements (They call their version of random search “Evolution Strategies,” but I’m sticking with my naming convention). In another piece of great work, <a href="https://arxiv.org/abs/1703.02660">Rajeswaran et al</a> showed that Natural Policy Gradient could learn <em>linear</em> policies that could complete these benchmarks. That is, they showed that static linear state feedback, like the kind we use in LQR, was also sufficient to control these complex robotic simulators. This of course left an open question: can simple random search find linear controllers for these MuJoCo tasks?</p>
<p>My students Aurelia Guy and Horia Mania tested this out, coding up a rather simple version of random search (the one from lqrpols.py in my previous posts). Surprisingly (or not surprisingly), this simple algorithm learns linear policies for the Swimmer-v1, Hopper-v1, HalfCheetah-v1, Walker2d-v1, and Ant-v1 tasks that achieve the reward thresholds previously proposed in the literature. Not bad!</p>
<p class="center"><img src="/assets/rl/mujoco/ars_v1.png" alt="random search attempt 1" width="560px" /></p>
<p>But random search alone isn’t perfect. Aurelia and Horia couldn’t get the humanoid model to do anything interesting at all. Having tried a lot of parameter settings, they decided to try to enhance random search to get it to train faster. Horia noticed that a lot of the RL papers were using statistics of the states and whitening the states before passing them into the neural net that defined the mapping from state to action. So he started to keep online estimates of the states and whiten them before passing them to the linear controller. And voila! With this simple trick, Aurelia and Horia now get state-of-the-art performance on Humanoid. Indeed, they can reach rewards over 11000 which is higher than anything I’ve seen reported. It is indeed almost twice the “success threshold” that was used for benchmarking by Salimans et al. Linear controller. Random search. One simple trick.</p>
<p class="center"><img src="/assets/rl/mujoco/ars_v1_v2.png" alt="random search attempt 2" width="560px" /></p>
<p>What’s nice about having something this simple is that the code is 15x faster than what is reported in the OpenAI Evolution Strategies paper. We can obtain higher rewards <em>with less computation.</em> One can train a high performing humanoid model in under an hour on a standard EC2 instance with 18 cores.</p>
<p>Now, with the online state updating, random search not only exceeds state-of-the-art on Humanoid, but also on Swimmer-v1, Hopper-v1, HalfCheetah-v1. But it isn’t yet as good on Walker2d-v1 and Ant-v1. But we can add one more trick to the mix. We can drop the sampled directions that don’t yield good rewards. This adds a hyperparameter (which fraction of directions to keep), but with this one additional tweak, random search can actually match or exceed the state-of-the-art performance of all of the MuJoCo baselines in the OpenAI gym. Note here, I am not restricting comparisons to policy gradient. As far as I know from our literature search, these policies are better than any results that apply model-free RL to the problem, whether it be an Actor Critic Method, a Value Function Estimation Method, or something even more esoteric. It does seem like pure random search is better than deep RL and neural nets for these MuJoCo problems.</p>
<p class="center"><img src="/assets/rl/mujoco/ars_v1_v2_v2t.png" alt="random search final attempt" width="560px" /></p>
<p>Random search with a few minor tweaks outperforms all other methods on these MuJoCo tasks and is significantly faster. We have a full paper with these results and more <a href="https://arxiv.org/abs/1803.07055">here</a>. And our code is <a href="https://github.com/modestyachts/ARS">in this repo</a>, though it is certainly easy enough to code up for yourself.</p>
<h2 id="what-can-reinforcement-learning-learn-from-random-search">What can reinforcement learning learn from random search?</h2>
<p>There are a few of important takeaways here.</p>
<h4 id="benchmarks-are-hard">Benchmarks are hard.</h4>
<p>I think the only reasonable conclusion from all of this is that these MuJoCo demos are easy. There is nothing wrong with that. But it’s probably not worth deciding NIPS, ICML, <em>or</em> ICLR papers over performance on these benchmarks anymore. This does leave open a very important question: <em>what makes a good benchmark for RL?</em>. Obviously, we need more than the Mountain Car. I’d argue that <a href="http://www.argmin.net/02/26/nominal">LQR with unknown dynamics</a> is a reasonable task to master as it is easy to specify new instances and easy to understand the limits of achievable performance. But the community should devote more time to understanding how to establish baselines and benchmarks that are not easily gamed.</p>
<h4 id="never-put-too-much-faith-in-your-simulators">Never put too much faith in your simulators.</h4>
<p>Part of the reason why these benchmarks are easy is that MuJoCo is not a perfect simulator. MuJoCo is blazingly fast, and is great for proofs of concept. But in order to be fast, it has to do some smoothing around the contacts (remember, discontinuity at contacts is what makes legged locomotion hard). Hence, just because you can get one of these simulators to walk, doesn’t mean that you can get an actual robot to walk. Indeed, here are four gaits that achieve the magic 6000 threshold. None of these look particularly realistic:</p>
<p class="center"><img src="/assets/rl/mujoco/pegleg.gif" alt="watch me hop" width="250px" />
<img src="/assets/rl/mujoco/ice.gif" alt="triple axel" width="250px" /></p>
<p class="center"><img src="/assets/rl/mujoco/backwards.gif" alt="moon walk" width="250px" />
<img src="/assets/rl/mujoco/cancan.gif" alt="on broadway" width="250px" /></p>
<p>even the top performing model (reward 11,600) looks like a very goofy gait that might not work in reality:</p>
<p class="center"><img src="/assets/rl/mujoco/reward_11600.gif" alt="run away" width="250px" /></p>
<h4 id="strive-for-algorithmic-simplicity">Strive for algorithmic simplicity.</h4>
<p>Adding hyperparameters and algorithmic widgets to simple algorithms can always improve their performance on a small enough set of benchmarks. I don’t know if dropping top-performing directions or state normalization will work on a new random search problem, but it worked for these MuJoCo benchmarks. Higher rewards might even be achieved by adding more adding tunable parameters. If you add enough bells and whistles, you can probably convince yourself that any algorithm works for a small enough set of benchmarks.</p>
<h4 id="explore-before-you-exploit">Explore before you exploit.</h4>
<p>Note that since our random search method is fast, we can evaluate its performance on many random seeds. These model-free methods all exhibit alarmingly high variance on these benchmarks. For instance, on the humanoid task, the the model is slow to train almost a quarter of the time even when supplied with what we thought were good parameters. And for those random seeds it finds rather peculiar gaits. It’s often very misleading to restrict one’s attention to 3 random seeds for random search, because you may be tuning your performance to peculiarities of the random number generator.</p>
<p class="center"><img src="/assets/rl/mujoco/humanoid_100seeds_med.png" alt="such variance" width="560px" /></p>
<p>This sort of behavior arose in LQR as well. We can tune our algorithm for a few random seeds, and then see completely different behavior on new random seeds. <a href="https://arxiv.org/abs/1709.06560">Henderson <em>and et</em></a> observed this phenomenon already with Deep RL methods, but I think that such high variability will be a symptom of all model-free methods. There are simply too many edge cases to account for through simulation alone. As I said in <a href="http://www.argmin.net/03/13/pg-saga">the last post</a>:
“<em>By throwing away models and knowledge, it is never clear if we can learn enough from a few instances and random seeds to generalize.</em>”</p>
<h2 id="i-cant-quit-model-free-rl">I can’t quit model-free RL.</h2>
<p>In a future post, I’ll have one more nit to pick with model-free RL. This is actually a nit I’d like to pick with all of reinforcement learning and iterative learning control: what exactly do we mean by “sample complexity?” What are we learning as a community from this line of research of trying to minimize sample complexity on a small number of benchmarks? And where do we, as a research community, go from here?</p>
<p>Before we get there though, let me take a step back to <a href="http://www.argmin.net/04/19/pid">assess some variants of model-free RL that both work well in theory and practice</a> and see if these can be extended to the more challenging problems currently of interest to the machine learning community.</p>
Tue, 20 Mar 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/03/20/mujocoloco/
http://benjamin-recht.github.io/2018/03/20/mujocoloco/