arg min blogMusings on systems, information, learning, and optimization.
http://benjamin-recht.github.io/
Digital Witnesses<p>Doyle derived his LQG counterexample in the time before the ubiquity of numerical computing. This meant that numerical examples did not carry the rhetorical weight of algebraic closed form instances. The need for clean, persuasive formulae also meant that controllers were idealized in continuous time. Continuous-time optimal control often produced policies that couldn’t be implemented because of the limits of physical reality: no system can act instantaneously with arbitrary power. These issues of infeasibility were <a href="https://ieeexplore.ieee.org/document/1099822/">certainly noted in the literature</a> during the hey day of optimal control, but continuous time models still often made it difficult to pinpoint these issues.</p>
<p>Discrete-time models don’t share many of these issues. In discrete time, we explicitly encode the sequential, computational nature of decision and control. Discrete-time formulae are unfortunately less elegant than their continuous-time counterparts, but, as I hope to show here, they are often more revealing. Indeed, constructing examples where discrete-time optimal control leads to fragile solutions seems to be surprisingly easy.</p>
<p>Here, I’ll highlight a few examples where relatively innocuous problem formulations lead to very fragile control policies. The examples are weirdly simple and almost comical to a point. But anyone who has played with discrete-time optimal control may have stumbled into similar control policies and had to step back and think about why.</p>
<p>Let’s revisit the discrete-time LQR problem:</p>
\[\begin{array}{ll} \text{minimize} & \sum_{t=1}^N \mathbb{E}_{w_t}\left[x_t^\top Q x_t + u_t^\top R u_t\right]\\
\text{subject to} & x_{t+1} = A x_t + B u_t + w_t
\end{array}\]
<p>We again assume $x_t$ is observed perfectly without noise. While such perfect state information is not realistic, even ideal state feedback ends up being fragile in discrete time. $w_t$ is assumed to be stochastic, but I don’t think much changes if we move to a more adversarial setting. Here, we need the decision variable $u_t$ to be <em>causal</em>. It must be a function of only the values $x_s$ and $u_s$ with $s\leq t$. For stochastic disturbances, the optimal $u$ can always be found by dynamic programming.</p>
<p>Consider the following innocuous dynamics:</p>
\[A = \begin{bmatrix} 0 & 1\\ 0 & 0\end{bmatrix} \,,~~~ B = \begin{bmatrix} 0\\1 \end{bmatrix}\,,\]
<p>This system is a simple, two-state shift register. I’ll write the state out with indexed components $x=[x^{(1)},x^{(2)}]^\top$. New states enter through the control $B$ into the second state. The first state, $x^{(1)}$ is simply whatever was in the second register at the previous time step. The open loop dynamics of this system are as stable as you could imagine. Both eigenvalues of $A$ are zero.</p>
<p>Let’s say our control objective aims to try to keep the two states equal to each other. We can model this with the quadratic cost:</p>
\[Q = \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix} \,, ~~~ R = 0\,.\]
<p>I assume $R=0$ here for simplicity, as the formulae are particularly nice for this case. But, as I will discuss in a moment, the situation is not improved simply by having $R$ be positive. For the disturbance, assume that $w_t$ is zero mean, has bounded second moment, $\Sigma_t = \mathbb{E}[w_t w_t^\top]$, and is uncorrelated with $x_t$ and $u_t$.</p>
<p>The cost is asking to minimize</p>
\[\sum_{t=1}^N (x_t^{(1)}-x_t^{(2)})^2\]
<p>When $w_t=0$, $x_t^{(1)}+x_t^{(2)} = x_{t-1}^{(2)}+u_{t-1}$, so it seems like our best bet is to just set $u_{t}=x_t^{(2)}$. This turns out to be the optimal action, and you can prove this directly using standard dynamic programming computations. What this means is that the closed loop dynamics of the system are</p>
\[x_{t+1} = \begin{bmatrix} 0 & 1\\ 0 &1 \end{bmatrix} x_t + w_t\,.\]
<p>This closed-loop system is <em>marginally stable</em>, meaning that while signals don’t blow up, some states will persist forever and not converge to $0$. Indeed, the state-transition matrix here has eigenvalues $0$ and $1$. The $1$ corresponds the state where the two components are equal, and such a state can persist forever.</p>
<p>If we learned an incorrect model of the dynamics, how would that influence the closed loop behavior? The simplest scenario is that we identified $B$ from some preliminary experiments. We can immediately see that if the true $B_\star=\alpha B $, then the closed loop dynamics are</p>
\[x_{t+1} = \begin{bmatrix} 0 & 1\\ 0 &\alpha \end{bmatrix} x_t + w_t\,.\]
<p>This system is unstable for any $\alpha>1$. That is, the system is arbitrarily sensitive to misidentification of the dynamics. Note that this lack of robustness has nothing to do with the noise sequence. The structure of the cost is what drives the system to fragility.</p>
<p>If $R>0$, you will get a slightly different policy. Again, using elementary dynamic programming shows that the optimal control is $u_t=\beta_t(R) x_t^{(2)}$ for some $\beta_t(R) \in (1/2,1)$. The closed loop system will be a bit more stable, but this comes at the price of reduced performance. And, at best, the gain margin of this system approaches $2$ as $R$ goes to infinity. You can also check that if you add $\epsilon$ times the identity to $Q$, you again get a control policy proportional to $x_t^{(2)}$.</p>
<p>This behavior can occur in even simpler systems. Consider the one-state linear system</p>
\[x_{t+1}= b u_t+w_t\,.\]
<p>The open loop system is again as stable as it gets. Now let’s aim to minimize $\Vert x-b u \Vert$. It doesn’t matter what norm you choose here or whether you treat the noise as stochastic or worst case with respect to $w$, the optimal control is going to be $u_t = x_t/b$. Once again, the closed loop system has a pole at $1$ and is arbitrary fragile to misspecification of $b$.</p>
<p>I could continue to construct nasty examples, but I hope these examples are sufficiently illustrative. They are certainly contrived and pathological, and it’s not at all clear that they reflect any optimal control problem you might have been hoping to solve. However, both examples involve systems that are robust and stable in open loop. It’s only when we close the feedback loop that we end up in a dangerous situation. That simple optimal control problems give some profoundly fragile solutions should be a clear warning: <em>You can’t just optimize and hope to be robust.</em> You have to consider uncertainty as a first class citizen when designing feedback systems.</p>
<p>In some sense, the core contribution of robust control is in raising awareness of fundamental tradeoffs in the design of feedback systems. Optimal control promises that you can roughly identify a system, model uncertainty as noise, solve an optimization problem, and then ship your policy. Hopefully, the examples in the last two posts have shown why this particular approach is fraught with danger.</p>
<p>If failure of a feedback system has any consequences, then a more holistic robust approach is <em>necessary</em>. We have to work with experts at different levels of the engineering pipeline, worry about unmodeled behaviors, and understand hard limits and practical tradeoffs. That is, engineering has to be more concerned with <em>design</em> than with <em>optimization.</em></p>
<p>There are all sorts of questions that a robust, systems level engineering effort might ask. Where should you put that extra sensor? Which parts of the system are likely to create issues? Is it possible to avoid performance disruptions when updating a single component in a legacy system? These questions are important in all aspects of system engineering, and developing accessible tools for addressing them in machine learning systems remains a daunting but essential challenge.</p>
<p>I am emphatically not saying that the design of feedback systems is hopeless. It’s easy to walk away with the impression “Ben’s examples are pathologies and unlike what I see in practice” or the pessimistic feeling of “shoot, all of this ML stuff is hopeless, I’m going to go work on something tractable like vaccine development.” I’m not saying that engineering robust machine learning systems is hopeless. I’m just saying that our community has to work better to incorporate multiple levels of uncertainty in its thinking. What are the fundamental tradeoffs between performance and robustness in machine learning? What do we even want to be robust to? In the next post I want to describe some of these robustness tradeoffs without using the language of optimization, probing if that provides some possible paths forward.</p>
Mon, 27 Jul 2020 00:00:00 +0000
http://benjamin-recht.github.io/2020/07/27/discrete-fragility/
http://benjamin-recht.github.io/2020/07/27/discrete-fragility/There are none<p>In the <a href="http://www.argmin.net/2020/07/08/gain-margin/">last post</a>, we showed that continuous-time LQR has “natural robustness” insofar as the optimal solution is robust to a variety of model-mismatch conditions. LQR makes the assumption that the state of the system is fully, perfectly observed. In many situations, we don’t have access to such perfect state information. What changes?</p>
<p>The generalization of LQR to the case with imperfect state observation is called “Linear Quadratic Gaussian” control (LQG). This is the simplest, special case of a Partially Observed Markov Decision Process (POMDP). We again assume linear dynamics:</p>
\[\dot{x}_t = Ax_t + B u_t + w_t\,.\]
<p>where the state is now corrupted by zero-mean Gaussian noise, $w_t$. Instead of measuring the state $x_t$ directly, we instead measure a signal $y_t$ of the form</p>
\[y_t = C x_t + v_t\,.\]
<p>Here, $v_t$ is also zero-mean Gaussian noise. Suppose we’d still like to minimize a quadratic cost function</p>
\[\lim_{T\rightarrow \infty} \frac{1}{T} \int_{0}^{T} (x_t^\top Qx_t + u_t^\top Ru_t) dt\,.\]
<p>This problem is very similar to our LQR problem except for the fact that we get an indirect measurement of the state and need to apply some sort of <em>filtering</em> of the $y_t$ signal to estimate $x_t$.</p>
<p>The optimal solution for LQG is strikingly elegant. Since the observation of $x_t$ is through a Gaussian process, the maximum likelihood estimation algorithm has a clean, closed form solution, even in continuous time. Our best estimate for $x_t$, denoted $\hat{x}_t$, given all of the data observed up to time $t$ obeys a differential equation</p>
\[\frac{d\hat{x}}{dt} = A\hat{x}_t + B u_t + L(y_t-C\hat{x}_t)\,.\]
<p>The matrix $L$ that can be found by solving an algebraic Riccati equation that depends on the variance of $v_t$ and $w_t$ and on the matrices $A$ and $C$. In particular, it’s the CARE with data $(A^\top,C^\top,\Sigma_w,\Sigma_v)$. This solution is called a <em>Kalman Filter</em> and is a continuous limit of the discrete time Kalman Filter one might see in a course on graphical models.</p>
<p>The optimal LQG solution takes the estimate of the Kalman Filter, $\hat{x}_t$, and sets the control signal to be</p>
\[u_t = -K\hat{x}_t\,.\]
<p>Here, $K$ is gain matrix that would be used to solve the LQR problem with data $(A,B,Q,R)$. That is, LQG performs optimal filtering to compute the best state estimate, and then computes a feedback policy as if this estimate was a noiseless measurement of the state. That this turns out to be optimal is one of the more amazing results in control theory. It decouples the process of designing an optimal filter from designing an optimal controller, enabling simplicity and modularity in control design. This decoupling where we treat the output of our state estimator as the true state is an example of <em>certainty equivalence</em>, the umbrella term for using point estimates of stochastic quantities as if they were the correct value. Though certainty equivalent control may be suboptimal in general, it remains ubiquitous for all of the benefits it brings as a design paradigm. Unfortunately, not only is this decoupled design of filters and controllers often suboptimal, it has many hidden fragilities. LQG highlights a particular scenario where certainty equivalent control leads to misplaced optimism about robustness.</p>
<p>We saw in the previous post that LQR had this amazing robustness property: even if you optimize with the wrong model, you’ll still probably be OK. Is the same true about LQG? What are the guaranteed stability margins for LQG regulators? The answer was succinctly summed up in the <a href="https://ieeexplore.ieee.org/document/1101812">abstract of a 1978 paper by John Doyle</a>: “There are none.”</p>
<p class="center"><img src="/assets/there_are_none.png" alt="There Are None" width="400px" /></p>
<p>What goes wrong? Doyle came up with a simple counterexample, that I’m going to simplify even further for the purpose of contextualizing in our modern discussion. Before presenting the example, let’s first dive into <em>why</em> LQG is likely less robust than LQR. Let’s assume that the true dynamics obeys the ODE:</p>
\[\dot{x}_t = Ax_t + B_\star u_t + w_t \,,\]
<p>though we computed the optimal controller with the matrix $B$. Define an error signal, $e_t = x_t - \hat{x}_t$, that measures the current deviation between the actual state and the estimate. Then, using the fact that $u_t = -K \hat{x}_t$, we get the closed loop dynamics</p>
\[\small
\frac{d}{dt} \begin{bmatrix}
\hat{x}_t\\
e_t
\end{bmatrix} = \begin{bmatrix} A-BK & LC\\ (B-B_\star) K & A-LC \end{bmatrix}\begin{bmatrix}
\hat{x}_t\\
e_t
\end{bmatrix} +
\begin{bmatrix} Lv_t\\ w_t-Lv_t \end{bmatrix}\,.\]
<p>When $B=B_\star$, the bottom left block is equal to zero. The system is then stable provided $A-BK$ and $A-LC$ are both stable matrices (i.e., have eigenvalues in the left half plane). However, small perturbations in the off-diagonal block can make the matrix unstable. For intuition, consider the matrix</p>
\[\begin{bmatrix} -1 & 200\\ 0 & -2 \end{bmatrix}\,.\]
<p>The eigenvalues of this matrix are $-1$ and $-2$, so the matrix is clearly stable. But the matrix</p>
\[\begin{bmatrix} -1 & 200\\ t & -2 \end{bmatrix}\]
<p>has an eigenvalue greater than zero if $t>0.01$. So a tiny perturbation significantly shifts the eigenvalues and makes the matrix unstable.</p>
<p>Similar things happen in LQG. In Doyle’s example he uses the problem instance:</p>
\[A = \begin{bmatrix} 1 & 1\\ 0 & 1\end{bmatrix} \,,~~~ B = \begin{bmatrix} 0\\1 \end{bmatrix}\,, ~~~ C= \begin{bmatrix} 1 & 0\end{bmatrix}\]
\[Q = \begin{bmatrix} 5 & 5 \\ 5 & 5 \end{bmatrix} \,, ~~~ R = 1\]
\[\mathbb{E}\left[w_t w_t^\top\right]=\begin{bmatrix} 1 & 1 \\ 1 & 1\end{bmatrix} \,,~~~ \mathbb{E}\left[v_t^2\right]=\sigma^2\]
<p>The open loop system here is unstable, having two eigenvalues at $1$. We can stabilize the system only by modifying the second state. The state disturbance is aligned along the $[1;1]$ direction, and the state cost only penalizes states aligned with this disturbance. So the goal is simply to remove as much signal as possible in the $[1;1]$ direction without using too much control authority. We only are able to measure the first component of the state, and this measurement is corrupted by Gaussian noise.</p>
<p>What does the optimal policy look like? Perhaps unsurprisingly, it focuses all of its energy on ensuring that there is little state signal along the disturbance direction. The optimal $K$ and $L$ matrices are</p>
\[K = \begin{bmatrix} 5 & 5 \end{bmatrix}\,,~~~L=\begin{bmatrix} d\\ d \end{bmatrix}\,,~~~d:=2+\sqrt{4+\sigma^{-2}}\,.\]
<p>Now what happens when we have model mismatch? If we set $B_\star=tB$ and use the formula for the closed loop above, we see that closed loop state transition matrix is</p>
\[\begin{bmatrix}
1 & 1 & d & 0\\
-5 & -4 & d & 0\\
0 & 0 &1-d &1\\
4(1-t) & 4(1-t) & -d &1
\end{bmatrix}\,.\]
<p>It’s straight forward to check that when $t=1$ (i.e., no model mismatch), the eigenvalues of $A-BK$ and $A-LC$ all have negative real parts. For the full closed loop matrix, analytically computing the eigenvalues themselves is a pain, but we can prove instability by looking at the characteristic polynomial. For a matrix to have all of its eigenvalues in the left half plane, its characteristic polynomial necessarily must have all positive coefficients. If we look at the linear term in the polynomial, we see that we must have</p>
\[t < 1 + \frac{1}{5d}\]
<p>if we’d like any hope of having a stable system. Hence, we can guarantee that this closed loop system is unstable if $t\geq 1+\sigma$. This is a very conservative condition, and we could get a tighter bound if we’d like, but it’s good enough to reveal some paradoxical properties of LQG. The most striking is that if we build a sensor that gives us a better and better measurement, our system becomes more and more fragile to perturbation and model mismatch. For machine learning scientists, this seems to go against all of our training. How can a system become <em>less</em> robust if we improve our sensing and estimation?</p>
<p>Let’s look at the example in more detail to get some intuition for what’s happening. When the sensor noise gets small, the optimal Kalman Filter is more aggressive. If the model is true, then the disturbance has equal value in both states, so, when $\sigma$ is small, the filter can effectively just set the value of the second state to be equal to whatever is in the first state. The filter is effectively deciding that the first state should equal the observation $y_t$, and the second state should be equal to the first state. In other words, it rapidly damps any errors in the disturbance direction $[1;1]$ and, as $d$ increases, it damps the $[0;1]$ direction less. When $t \neq 1$, we are effectively introducing a disturbance that makes the two states unequal. That is, $B-B_\star$ is aligned in the $[0;1]$ and can be treated as a disturbance signal. This undamped component of the error is fed errors from the state estimate $\hat{x}$, and these errors compound each other. Since we spend so much time focusing on our control along the direction of the injected state noise, we become highly susceptible to errors in a different direction and these are the exact errors that occur when there is a gain mismatch between the model and reality.</p>
<p>The fragility of LQG has many takeaways. It highlights that noiseless state measurement can be a dangerous modeling assumption, because it is then optimal to trust our model too much. Though we apparently got a freebie with LQR, for LQG, model mismatch must be explicitly accounted for when designing the controller.</p>
<p>This should be a cautionary tale for modern AI systems. Most of the papers I read in reinforcement learning consider MDPs where we get perfect state measurement. Building an entire field around optimal actions with perfect state observation builds too much optimism. Any realistic scenario is going to have partial state observation, and such problems are much thornier.</p>
<p>A second lesson is that it is not enough to just improve the prediction components in feedback systems that are powered by machine learning. I have spoken with many applied machine learning engineers who have told me that they have seen performance degrade in production systems when they improve their prediction model. They might spend months building some state of the art LSTM mumbo jumbo that is orders of magnitude more accurate in prediction, but in production yields worse performance than the legacy system with a boring ARMA model. It is quite possible that these performance drops are due to the Doyle effect: the improved prediction system is increasing sensitivity to a modeling flaw in some other part of the engineering pipeline.</p>
<p>The story turns out to be even worse than what I have described thus far. The supposed robustness guarantees we derived for LQR assume not just full noiseless state measurement, but that the sensors and actuators have infinite bandwidth. That is, they assume you can build controllers $K$ with arbitrarily large entries and that react instantaneously, without delay, to changes in the state. In the next post, I’ll show how realistic sampled data controllers for LQR, even with noiseless state measurement, also have no guarantees.</p>
Tue, 14 Jul 2020 00:00:00 +0000
http://benjamin-recht.github.io/2020/07/14/there-are-none/
http://benjamin-recht.github.io/2020/07/14/there-are-none/Margin Walker<p>I want to dive into some classic results in robust control and try to relate them to our current data-driven mindset. I’m going to try to do this in a modern way, avoiding any frequency domain analyses.</p>
<p>Suppose you want to solve some optimal control problem: you spend time modeling the dynamics of your system, how it responds to stimuli, and which objectives you’d like to maximize and constraints you must adhere to. Each of these modeling decisions explicitly encodes both your beliefs about reality and your mental criteria of success and failure. <em>Robustness</em> aims to quantify the effects of oversight on your systems behavior. Perhaps your model wasn’t accurate enough, or perhaps you forgot to include some constraint in your objective. What are the downstream consequences?</p>
<p>In the seventies, it was believed that optimization-based frameworks for control had “natural robustness.” The solutions of optimal control problems were often robust to phenomena not explicitly modeled by the engineer. As a simple example, suppose you have an incorrect model of the dynamical system you are trying to steer. How accurate do you need to be in order for this policy to be reasonably successful?</p>
<p>To focus in on this, let’s study the continuous-time linear quadratic regulator (LQR). I know I’ve been arguing that we should be moving away from LQR in order to understand the broader challenges in learning and control, but the LQR baseline has so many lessons to teach us. Please humor me again for a few additional reasons: First, most of the history I want to tell arises from studying continuous-time LQR in the 1970s. It’s worth understanding that history with a modern perspective. Second, LQR does admit elegant closed form formulae that are helpful for pedagogy, and they are particularly nice in continuous time.</p>
<h2 id="lqr-in-continuous-time">LQR in Continuous Time</h2>
<p>Suppose we have a dynamical system that we model as an ODE:</p>
\[\dot{x}_t = Ax_t + Bu_t\,.\]
<p>Here, as always, $x_t$ is the state, $u_t$ is the control input signal, and $A$ and $B$ are matrices of appropriate dimensions. The goal of the continuous-time LQR problem is to minimize the cost functional</p>
\[J_{\text{LQR}}=\int_{0}^{\infty} (x_t^\top Qx_t + u_t^\top Ru_t) dt\]
<p>over all possible control inputs $u_t$. Let’s assume for simplicity that $Q$ is a positive semidefinite matrix and $R$ is positive definite.</p>
<p>The optimal LQR policy is <em>static state feedback</em>: there is some matrix $K$ such that</p>
\[u_t = -Kx_t\]
<p>for all time. $K$ has a closed form solution that can be found by solving a <em>continuous algebraic Riccatti equation</em> (CARE) for a matrix $P$:</p>
\[A^\top P + PA - PBR^{-1}B^\top P + Q = 0\,,\]
<p>and then setting</p>
\[K = R^{-1}B^\top P\,.\]
<p>Importantly, we take the solution of the CARE where $P$ that is positive definite. If a positive definite solution of the CARE exists, then it is optimal for continuous time LQR. There are a variety of ways to prove this condition is sufficient, including an appeal to dynamic programming in continuous time. A simple argument I like uses the quadratic structure of LQR to derive the necessity of the CARE solution. (I found this argument in <a href="https://www.ece.ucsb.edu/~hespanha/linearsystems/">Joao Hespansha’s book</a>).</p>
<p>Regardless, showing a positive definite CARE solution exists takes considerably more work. It suffices to assume that the pair $(A,B)$ is controllable and the pair $(Q,A)$ is detectable. But proving these conditions are sufficient requires a lot of manipulation of linear algebra, and I don’t think I could cleanly distill a proof into a blog post. I mention this just to reiterate that while LQR is definitely the simplest problem to study, its analysis in continuous time on an infinite time horizon is nontrivial. LQR is not really “easy.” It’s merely the easiest problem in a space of rather hard problems.</p>
<h2 id="gain-margins">Gain margins</h2>
<p>Let’s now turn to robustness. Suppose there is a mismatch between our modeled dynamics and reality. For example, what if the actual system is</p>
\[\dot{x}_t = Ax_t + B_\star u_t\,.\]
<p>for some matrix $B_\star$. Such model mismatches occur all the time. For example, in robotics, we can send a signal “u” to the joint of some robot. This would be some voltage that would need to be linearly transformed into some torque by a motor. It requires a good deal of calibration to make sure that the output of the motor is precisely the force dictated by the voltage output from our controller. Is there a way to guarantee some leeway in the mapping from voltage to torque?</p>
<p>An attractive feature of LQR is that we can quantify precisely how much slack we have directly from the CARE solution. We can use the solution of the CARE to build a <em>Lyapunov function</em> to guarantee stability of the system. Recall that a Lyapunov function is a function $V$ that maps states to real numbers, is nonnegative everywhere, is equal to $0$ only when $x=0$, and whose value is strictly decreasing along any trajectory of a dynamical system. In equations:</p>
\[V(x)\geq 0\,,~~~~V(x)=0~~\text{iff}~~x=0\,,~~~~\dot{V} <0\,.\]
<p>If you have a Lyapunov function, then all trajectories must converge to $x=0$: if you are at any nonzero state, the value of $V$ will decrease. If you are at $0$, then you will be at a global minimum of $V$ and hence can’t move to any other state.</p>
<p>Let $P$ be the solution of the CARE and let’s posit that $V(x) = x^\top P x$ is a Lyapunov function. Since $P$ is positive definite, we have $V(x)\geq 0$ and $V(x)=0$ if and only if $x=0$. To prove that the derivative of the Lyapunov function is negative, we can first compute the derivative:</p>
\[\frac{d}{dt} x_t^\top P x_t = x_t^\top \left\{(A-B_\star K)^\top P + P(A-B_\star K) \right\}x_t\,.\]
<p>Note that it is sufficient to show that $(A-B_\star K)^\top P + P(A-B_\star K)$ is a negative definite matrix as this would prove that the derivative is negative for all nonzero $x_t$. To prove that this expression is negative definite, let’s apply a bit of algebra to generate some sufficient conditions. Using the definition of $K$ and the fact that $P$ solves the CARE gives the following chain of equalities:</p>
\[\begin{aligned}
&(A-B_\star K)^\top P + P(A-B_\star K) \\
&= A^\top P + PA - K^\top B_\star^\top P - P B_\star K\\
&=PBR^{-1}B^\top P - Q - K^\top B_\star^\top P - P B_\star K\\
&=PBR^{-1}B^\top P - Q - PBR^{-1}B_\star^\top P - P B_\star R^{-1} B^\top P\\
&=P(B-B_\star)R^{-1}(B-B_\star)^\top P - PB_\star R^{-1} B_\star^\top P - Q
\end{aligned}\]
<p>Here, the first equality is simply expanding the matrix product. The second equation uses the fact that $P$ is a solution to the CARE. The third equality uses the definition of $K$. The final equation is an algebraic rearrangement.</p>
<p>With this final expression, we can cook up a huge number of conditions under which we get “robustness for free.” First, consider the base case where $B=B_\star$. Since $R$ is positive definite and $Q$ is positive semidefinite, the entire expression is negative definite, and hence we have proven the system is stable.</p>
<p>Second, there is a famous result that LQR has “large gain margins.” The gain margin of a control system is an interval $(t_0,t_1)$ such that for all $t$ in this interval, our control system is stable with the controller $tK$. Another way of thinking about the gain margin is to assume that $B_\star = tB$, and to find the largest interval such that the system $(A,B_\star)$ is stabilized by a control policy $K$. For LQR, there are very large margins: if we plug in the identity $B_\star=tB$, we find that $x^\top P x$ is a Lyapunov function provided that $t \in (\tfrac{1}{2},\infty)$. LQR control turns out to be robust to a wide range of perturbations to the matrix $B$. Intuitively, it makes sense that if we would like to drive a signal to zero and have more control authority than we anticipated then our policy will still drive the system to zero. This is the range of $t \in [1,\infty)$. The other part of the interval is perhaps more interesting: for the LQR problem, even if we only have half of the control authority we had planned for, we still will successfully stabilize our system from any initial condition.</p>
<p>In discrete time, you can derive similar formulae with essentially the same argument. Unfortunately, the expressions are not as elegant. Also, note that you cannot expect infinite gain margins in discrete time. In continuous time a differential equation $\dot{x}_t = M x_t$ is stable if all of the eigenvalues of $M$ have negative real parts. In discrete time, you need all of the eigenvalues to have magnitude less than $1$. For almost any random set triple $(A,B,K)$, $A-t B K$ is going to have large eigenvalues for $t$ large enough. Nonetheless, you can certainly derive analogous conditions as to which errors are tolerable.</p>
<p>There are a variety of other conditions that can be derived from our matrix expression. Most generally, the control system will be stable provided that</p>
\[(B-B_\star)R^{-1}(B-B_\star)^\top \prec B_\star R^{-1} B_\star^\top \,.\]
<p>The LQR gain margins fall out naturally from this expression when we assume $B_\star = t B$. However, we can guarantee much more general robustness using this inequality. For example, if we assume that $B_\star = BM$ for some square matrix $M$, then $K$ stabilizes the pair $(A,B_\star)$ if all of the eigenvalues of $M+M^\top $ are greater than $1$.</p>
<p>Perhaps more in line with what we do in machine learning, suppose we are able to collect a lot of data, do some uncertainty quantification, and guarantee a bound $|B-B_\star|_2<\epsilon$. Then as long as</p>
\[\epsilon \leq \lambda_\text{min}(R)\lambda_\text{min}\left(P^{-1} Q P^{-1}\right)\]
<p>we will be guaranteed stable execution. This expression depends on the matrices $P$, $Q$, and $R$, so it has a different flavor of the infinite gain margin conditions which held irrespective of the dynamics or the cost. Moreover, if $P$ has large eigenvalues, then we are only able to guarantee safe execution for small perturbations to $B$. This foreshadows issues I’ll dive into in later posts. I want to flag here that these calculations reveal some fragilities of LQR: While the controller is always robust to perturbations along the direction of the matrix $B$, you can construct examples where the system is highly sensitive to tiny perturbations orthogonal to $B$. <a href="https://www.argmin.net/2020/07/14/there-are-none/">I’ll return in the next post</a> to start to unpack how optimal control has some natural robustness, but it has natural fragility as well.</p>
Wed, 08 Jul 2020 00:00:00 +0000
http://benjamin-recht.github.io/2020/07/08/gain-margin/
http://benjamin-recht.github.io/2020/07/08/gain-margin/What We've Learned to Control<p>I’m giving a keynote address at the <a href="https://www.ifac2020.org/">virtual IFAC congress this July</a>, and I submitted an abstract that forces me to reflect on the current state of research at the intersection of machine learning and control. 2020 is particularly appropriate for reflection: For personal reasons, I’ve been working in this space for about half a decade now and <a href="https://www.argmin.net/2018/06/25/outsider-rl/">wrote a blog series on the topic two years ago</a> and it seemed like ideal timing. For the broader community, 2020 happens to be the year we were promised fleets of self-driving cars. Of course, for a myriad of reasons, we’re nowhere close to achieving this goal. Full self-driving has been a key motivator of work in learning-enabled autonomous systems, and it’s important to note this example as a marker of how difficult problems this space really are.</p>
<p>The research community has come to terms with this difficulty, and has committed itself to address the many pressing challenges. Over the last year I attended several great meetings on this topic, including an <a href="https://ajwagen.github.io/adsi_learning_and_control/">NSF funded workshop at UW</a>, a plenary session at <a href="https://ita.ucsd.edu/ws/">ITA</a>, a workshop on intersections of <a href="https://www.ipam.ucla.edu/programs/workshops/intersections-between-control-learning-and-optimization/?tab=overview">learning, control, and optimization at IPAM</a>, and the <a href="https://sites.google.com/berkeley.edu/l4dc/home">second annual conference on Learning for Dynamics and Control</a>. There is clearly a ton of enthusiasm from researchers in a many different disciplines, and we’re seeing fascinating results mixing techniques from machine learning, computer science, and control. Obviously, I’m going to be leaving out many incredible papers, but, focusing on the theoretical end of the spectrum, perhaps I could highlight <a href="https://arxiv.org/abs/1912.11899">new work on policy optimization</a> that demonstrates how simple optimization techniques can efficiently solve classic, nonconvex control-design problems or <a href="https://arxiv.org/abs/1902.08721">work connecting regret minimization and adaptive control</a> that provides nonasymptotic bounds.</p>
<p>This work has been very useful for establishing language to bridge communication between the diverse research camps interested in the space of learning, dynamics and automation. But, as I’ll discuss shortly, I’d argue that it hasn’t provided many promising approaches to improving large-scale autonomy. That’s ok! These problems are incredibly difficult and aren’t going to be solved by wishing for them to be solved. But I also think it might be worth taking a moment to reflect on which problems we are working on. It’s always a bit too easy for theorists to focus on improving technical results and lose sight of why someone proved those results in the first place.</p>
<p>As an illustrative example, my research group spent a lot of time studying the <a href="http://www.argmin.net/2018/02/08/lqr/">Linear Quadratic Regulator</a> (LQR). The point of this work was initially to establish baselines: LQR has a closed form solution when the model is known, so we wanted to understand how different algorithms might perform when the underlying model was unknown. It turns out that if you are willing to collect enough data, the best thing you can do for LQR is <a href="https://arxiv.org/abs/1902.07826">estimate the dynamical model, and then exploit this model as if it were true</a>. This so-called “certainty equivalent control” is what practitioners have been doing since the mid-60s to fly satellites and solve other optimal control problems. Proving this result required a bunch of new mathematical insights that established connections between high dimensional statistics and automatic control theory. But it did not bring us closer to solving new challenges in robotics or autonomous systems. Our work here merely showed that what the controls community had been doing for 50 years was already about as well as we could do for this important baseline problem.</p>
<p>So what are the ways forward? Are there things that theory-minded folks can work on short term that might help us understand paths towards improving learning systems in complex feedback loops? Let me suggest a few challenges that I see as both very pressing, but also ones where we might be able to make near-term progress.</p>
<h2 id="machine-learning-is-still-not-reliable-technology">Machine Learning is still not reliable technology</h2>
<p>At the aforementioned <a href="https://www.ipam.ucla.edu/programs/workshops/intersections-between-control-learning-and-optimization/?tab=overview">IPAM meeting</a>, Richard Murray gave a <a href="https://www.youtube.com/watch?v=Wi8Y---ce28">fantastic survey of the sorts of standards of reliability imposed in aerospace engineering</a>. Go watch it! I don’t want to spoil it for you, but his discussion of Ram Air Turbines is gripping. Richard covers what is needed to get to the sorts of reliability we’d like in autonomous systems. Unfortunately, having <a href="https://arxiv.org/abs/2003.08237">88.5% Top-1 accuracy on ImageNet</a>—while a stunning achievement—doesn’t tell us how to get to systems with failure rates on the order of 1 in a billion. As Boeing has tragically shown, cutting corners on autonomous system safety standards has horrible, tragic consequences.</p>
<p>How can we make machine learning more robust? How can we approach the failure rates needed for safe, reliable autonomy? And how can we establish testing protocols to assure we have such low failure rates?</p>
<h2 id="prediction-systems-in-feedback-loops">Prediction systems in feedback loops</h2>
<p>One particular aspect that I think is worth considering is how supervised learning systems can function as “sensors” in feedback loops. Even if you know everything about a dynamical system, when you observe the state via an estimator generated by a learned component, it’s not clear how to best take action on this observation. Most classic control and planning assumes that your errors in state-estimation are Gaussian or nicely uniformly bounded. Of course, the errors from machine learning systems are neither of these (I recommend checking out the <a href="https://youtu.be/A0cb7wZVFf4">crazy videos</a> of the <a href="https://twitter.com/greentheonly/status/1130956365063761920">confusion</a> that comes out of Tesla Autopilot’s vision systems). How to properly characterize the errors of machine learning systems for control applications seems like a useful, understudied problem. Using off-the-shelf machine learing analysis, it’s unavoidable to have to densely sample all of the possible scenarios in advance in order to guarantee the sort of uniform error bounds desired by control algorithms. This isn’t practical, and, indeed, it’s clear that this sort of sensor characterization is not needed to make reasonable demos work. Though it’s a bit mundane, I think a huge contribution lies in understanding how much data we need to quantify the uncertainty in learned perception components. It’s still not clear to me if this is a machine learning question or a closed-loop design question, and I suspect both views of the problem will be needed to make progress.</p>
<h2 id="why-are-we-all-sleeping-on-model-predictive-control">Why are we all sleeping on model predictive control?</h2>
<p>I still remain baffled by how <a href="http://www.argmin.net/2018/05/02/adp/">model predictive control</a> (MPC) is consistently under appreciated. We’ll commonly see the same tasks in the same meeting, one task done on a robot using some sort of deep reinforcement learning and the other done using model predictive control, and the disparity in performance is stark. It’s like the difference between watching an Olympic level sprinter and me jogging in my neighborhood with a set of orthotics.</p>
<p>Here’s an example from the IPAM workshop. Martin Riedmiller presented work at DeepMind to catch a ball in a cup:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/LSkgLazbpko?start=2994" frameborder="0" allowfullscreen="" class="center"></iframe>
<p>This system uses two cameras, has a rather large “cup,” (it’s a wastepaper basket) and yet still takes 3 days to train on the robot. Francesco Borrelli presented a different approach. Using only a single camera and basic, simple Newtonian physics, and MPC they were able to achieve this performance on the standard-sized “ball-in-a-cup” toy:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZFxmVDBYyDY?start=938" frameborder="0" allowfullscreen="" class="center"></iframe>
<p>If you only saw these two videos, I can’t fathom why would you invest all of your assets into deep RL. I understand there are still a lot of diehards out there, and I know this will offend them. But I want to make a constructive point: so many theorists are spending a lot of time studying RL algorithms, but few in the ML community are analyzing MPC and why it’s so successful. We should rebalance our allocation of mental resources!</p>
<p>Now, while the basic idea of MPC is very simple, the theory gets very hairy very quickly. It definitely takes some time and effort to learn about how to prove convergence of MPC protocols. I’d urge the MPC crowd to connect more with the learning theory crowd to see if a common ground can be found to better understand how MPC works and how we might push its performance even farther.</p>
<h2 id="perhaps-we-should-stop-taking-cues-from-alphago">Perhaps we should stop taking cues from AlphaGo?</h2>
<p>One of the grand goals in RL is to use function approximation algorithms to estimate value functions. The conventional wisdom asserts that the world is a giant Markov Decision Process and once you have its value function, you can just greedily maximize it and you’ll win at life. Now, this sort of approach clearly doesn’t work for robots, and I’m perplexed by why people still think it will work at all. Part of the motivation is that this approach was used to solve Go. But at some point I think we all have to come to terms with the fact that games are not the real world.</p>
<p>Now, I’d actually argue that RL <em>does</em> work in the real world, but it’s in systems that most people don’t actively think of as RL systems. Greedy value function estimation and exploitation is <em>literally</em> how all internet revenue is made. Systems simply use past data to estimate value functions and then choose the action that maximizes the value at the next step. Though seldom described as such, these are instances of the “greedy contextual bandit” algorithm, and this algorithm makes tech companies tons of money. But many researchers have also pointed out that this algorithm leads to misinformation, polarization, and radicalization.</p>
<p>Everyone tries to motivate RL by the success of AlphaGo, but they should be using the success of Facebook and Google instead. And if they did this, I think it would be a lot more clear why RL is terrifying and dangerous, and one whose limitations we desperately need to understand so that we can build safer tools.</p>
<h2 id="lessons-from-the-70s-about-optimal-control">Lessons from the 70s about optimal control</h2>
<p>I have one set of ideas along these lines that, while I think is important, I still am having a hard time articulating. Indeed, I might just take a few blog posts to work through my thoughts on this, but let me close this blog with a teaser of discussions to come. As I mentioned above, optimal control was a guiding paradigm for a variety of control applications in the 60s and 70s. During this time, it seemed like there might even be hidden benefits to a full-on optimization paradigm: though you’d optimize a single, simple objective, you would often get additional robustness guarantees for free. However, it turned out that this was very misleading and that <a href="https://ieeexplore.ieee.org/document/1101812">there were no guarantees of robustness even for simple optimal control problems</a>. This shouldn’t be too surprising, as if you devote a lot of resources towards one objective, you are likely neglecting some other objective. But showing how and why these fragilities arise is quite delicate. It’s not always obvious how you <em>should</em> be devoting your resources.</p>
<p>Trying to determine how to allocate engineering resources to balance safety and performance is the heart of “robust control.” One thing I’m fascinated by moving forward is if any of the early developments in robust control might transfer over for a new kind of “robust ML.” Unfortunately for all of us, robust control is a rather encrypted literature. There is a lot of mathematics, but often not clear statements about <em>why</em> we study particular problems or what are the fundamental limits of feedback. While diligent young learning theorists have been scouring classic control theory text books for insights, these books don’t always articulate what we can and cannot do and what are the problems that control theory might help solve. We still have a lot of work to do in communicating what we know and what problems remain challenging. I think it would be useful for control theorists to think of how to best communicate the fundamental concepts of robust control. I hope to take up this challenge in the next few months on this blog.</p>
<p><em>I’d like to thank Sarah Dean, Horia Mania, Nik Matni, and Ludwig Schmidt for their helpful feedback on this post. I’d also like to thank John Doyle for several inspiring conversations about robustness in optimal control and on the encrypted state of the control theory literature.</em></p>
Mon, 29 Jun 2020 00:00:00 +0000
http://benjamin-recht.github.io/2020/06/29/tour-revisited/
http://benjamin-recht.github.io/2020/06/29/tour-revisited/The Uncanny Valley of Virtual Conferences<p>We wrapped up two amazing days of <a href="http://www.l4dc.org/">L4DC 2020</a> last Friday. It’s pretty wild to watch this community grow so quickly: starting as a <a href="https://kgatsis.github.io/learning_for_control_workshop_CDC2018/">workshop</a> at <a href="https://kgatsis.github.io/learning_for_control_workshop_CDC2018/">CDC 2018</a>, the conference organizers put together an <a href="https://l4dc.mit.edu/">inaugural event at MIT</a> in only a few months and were overwhelmed by nearly 400 attendees. Based on a groundswell of support from the participants, we decided to add contributed talks and papers this year. We had passionate volunteers for our <a href="https://sites.google.com/berkeley.edu/l4dc/organizers-pc">70-person program committee</a>, and they did heroic work of reviewing 135 submissions for this year’s program.</p>
<p>Then, of course, the pandemic hit forcing us to cancel our in-person event. As most conferences in a similar situation as ours, we decided to move to a virtual setting. I think that had we not had contributed papers, we would have simply canceled this year (I’ll return to this later). But to respect the passion and hard-work of our contributors, we tried to come up with a reasonable plan for running this conference virtually.</p>
<p>When we started planning to go virtual, there were too many options to sort through: Zoom webinars and breakout rooms? Sli.do Q&As? Google Hangouts? Slack channels? We had so many tools for virtual community building, each with their own pluses and minuses. Our main constraints were that we wanted to highlight the best contributed papers as talks in some way, to give visibility to the wonderful set of accepted papers without burdening the authors with more work, to be inclusive to the broader community of folks interested in learning and automation, and, importantly, to not charge registration fees.</p>
<p>We eventually settled on the following scheme:</p>
<ol>
<li>We had a Zoom room for invited and contributed speakers and moderators.</li>
<li>This Zoom was <a href="https://www.youtube.com/watch?v=b_sJb1k9dVY">live streamed to Youtube</a>.</li>
<li>Questions were gathered by grad student moderators who scanned the YouTube live chat and then relayed inquiries back to the speakers.</li>
<li>We tried to keep the live part under four hours per day and to provide ample breaks. We recognize how hard it is to sit in front of a live stream for much more than that.</li>
<li>Further discussion was then done on <a href="https://openreview.net/group?id=L4DC.org/2020/Conference">OpenReview</a>, where we hosted all accepted papers of the conference.</li>
<li>The proceedings of the conference were subsequently archived by <a href="http://proceedings.mlr.press/">Proceedings of Machine Learning Research</a>.</li>
</ol>
<p>Though it took a lot of work to tie all these pieces together, everything went super smoothly in the end. I was basically able to run the entire AV setup from my garage.</p>
<p class="center"><img src="/assets/command_station.jpg" alt="where the magic happens" width="250px" /></p>
<p>The only thing that cost money here was the Zoom account (20 dollars/month, though subsidized by Berkeley) and my home internet connection. I know that Zoom and YouTube have well documented issues, and I think it’s imperative that they continue to strive to fix these problems, but I also think it’s easy to forget how empowering this technology is. This format opens up conferences to those who can’t travel for financial or logistical reasons, and lowers the energy to engage with cutting edge research. Being able to sit in my garage and run a virtual conference with speakers spanning 10 time zones and nearly 2000 viewers is a wonder of modern times.</p>
<h2 id="second-life-still-has-a-long-way-to-go">Second Life still has a long way to go.</h2>
<p>There are still many parts of the online conference that felt cheated and incomplete. I still don’t know how to run a virtual poster session effectively. Most of our papers have not yet received any comments on <a href="https://openreview.net/group?id=L4DC.org/2020/Conference">OpenReview</a>, though comments are still open and I’d encourage you to drop by and ask questions! Partially, I think this lack of engagement stems from the considerable amount of effort required to participate, especially when it is compared to somewhat aimlessly ambling through a poster session.</p>
<p>Indeed, many aspects of live conferences are simply not replicable with our current tools, whether they be chance encounters or meetings with friends from far away. On the other hand, maybe we shouldn’t try to replicate this experience! Maybe we need to think harder about what opportunities our technology has for building communities and how we can better support these facets of academic interaction. When I think back on the decades of conferences I’ve attended, I can think of only a few posters that really got me interested in reading a paper, and <a href="https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf">one later won a test of time award at NeurIPS</a>. Poster sessions always felt like an anachronistic means to justify a work travel expense rather than an effective means of academic knowledge dissemination. Is there a better way forward that uses our current technological constraints to amplify the voices of young scholars with cutting edge ideas? I don’t have great ideas for how to do this yet, but new interaction structures may emerge as we deal with at least one more year without meetings with hundreds of people.</p>
<h2 id="how-much-should-conferences-cost">How much should conferences cost?</h2>
<p>We were able to do L4DC, with the proceedings and all, for free. Obviously, the program committee put in tons of work in reviewing and organizing the logistics. But reviewing labor isn’t compensated by any conference. All peer reviewed conferences rely on the volunteer service labor of a dedicated program committee. The main line items we expected for L4DC were for renting physical space, paying an AV crew, and food. But in the virtual world, these expenses drop to near zero.</p>
<p>I’m supposed to give a plenary talk at the <a href="https://www.ifac2020.org/">Virtual IFAC Congress</a> in July. I have to say, I am troubled: IFAC is charging <a href="https://www.ifac2020.org/registration/">380 euros per person</a> for registration. [<strong>UPDATE (06/28/20):</strong> <em>IFAC has reduced their registration fee to 80 euros for those who wish to watch videos but not upload a paper. 40 euros for students. Kudos to them for reducing the fees.</em>]
What does one get for this sum? Access to video streams and the ability to publish papers. This seems exorbitantly expensive. Why would anyone watch a talk I give at IFAC when I promise to just release it on YouTube at the same time? What value is IFAC providing back to the academic community?</p>
<h2 id="decoupling-papers-from-talks">Decoupling papers from talks</h2>
<p>One of the main things the registration fee at many conferences provides is a stamp of academic approval. It is a de facto publication fee. Led by computer science, conferences in engineering are replacing journals as the archives where CV-building work is cataloged. Though this wasn’t the initial purpose of conferences in computer science, conferences do have many attractive features over journals for rapidly evolving fields: Conferences have speedy turn-around times and clearly delineated submission and decision dates. This archival aspect of conferences, however, has nothing to do with community building or scholarly dissemination. Why do we need to couple a talk to a publication? Can’t we separate these two as is done in every other academic field?</p>
<p>Our collective pandemic moment gives us an opportunity not only to rethink community-building but also our publication model. With 10000-person mega-conferences like <a href="http://icml.cc">AI Summer</a> and <a href="http://neurips.cc">AI Winter</a>, why can’t we keep all of the deadlines the same but remove all of the talks? We’d still have the same reviewing architecture, which has been wholly virtual for over a decade. And we could still publish all of the proceedings online for free, which has been done for multiple decades.</p>
<p>The decoupling proposal here would have effectively zero overhead on our communities: the deadlines, CMTs, program committees, and proceedings could all function exactly the same way (though, to be fair, these systems all have warts worth improving upon). New archival, fast-turnaround journals could easily start using the same tools. Indeed, I’ve always been enamored with the idea of an arxiv-overlay journal that simply is a table of contents that points towards particular versions of arxiv papers as “accepted.” And a really radical idea would be to solicit <em>talks</em>—not papers—for virtual conferences where potential speakers would submit slides or videos to demonstrate proficiency in the medium in which they’d present.</p>
<p>I tend to dismiss most of the bloviation about how coronavirus permanently changes everything about how we live our lives. But it does provide us an opportunity to pause and assess whether current systems are functioning well. I’d argue that the current conference system hasn’t been functioning well for a while, but this simple decoupling of papers and talks might clear up a lot of the issues currently facing the hard-charging computing world.</p>
<p><em>Many thanks to my dedicated, passionate L4DC Co-organizers: Alex Bayen, Ali Jadbababie, George Pappas, Pablo Parrilo, Claire Tomlin, and Melanie Zeilinger. I’d also like to thank Rediet Abebe, Jordan Ellenberg, Eric Jonas, Angjoo Kanazawa, Adam Klivans, Nik Matni, Chris Re, and Tom Ristenpart for their helpful feedback on this post.</em></p>
Mon, 22 Jun 2020 00:00:00 +0000
http://benjamin-recht.github.io/2020/06/22/virtual-conferences/
http://benjamin-recht.github.io/2020/06/22/virtual-conferences/You Cannot Serve Two Masters: The Harms of Dual Affiliation<p>Facebook would like to have computer science faculty in AI committed to work 80% of their time in industrial jobs and 20% of their time at their university. They call this scheme “<a href="https://newsroom.fb.com/news/2018/07/facebook-ai-research-expands/">co-employment</a>” or “<a href="https://www.facebook.com/schrep/posts/10156638732909443">dual</a> <a href="https://www.businessinsider.com/facebook-yann-lecun-dual-affiliation-model-ai-experts-2018-8">affiliation</a>.” This model assumes people can slice their time and attention like a computer, but people can’t do this. Universities and companies are communities, each with their particular missions and values. The values of these communities are often at odds, and researchers must choose where their main commitment lies. By committing researchers to a particular company’s interests, this new model of employment will harm our colleagues, our discipline, and everyone’s future. Like many harms, it comes with benefits for some. But the harm in this proposal outweighs the benefits. If industry wants to support and grow academic computer science, there are much better ways to achieve this.</p>
<p>The proposal will harm our discipline, because it will distract established talent from the special roles of academics: curiosity driven research. Academic scholarship has an excellent record of pursuing ideas into places that are exciting and productive, even if they don’t result in immediate, tangible benefits and especially if they ruffle the feathers of established, powerful institutions. You can’t do that if 80% of your time is spent not annoying a big company. Though big companies belabor promises of complete intellectual freedom to faculty, that can’t and won’t happen because the purpose of companies is to make money for shareholders.</p>
<p>The proposal harms our students directly. Our faculty at their best secure everyone’s future by teaching talented students how to understand the challenges facing the broader world. Such mentorship is enriched by the courage, independence, security, and trained judgement of senior scholars to guide students’ perspectives on what is worth doing, what is likely irrelevant, and what is wrong. Engaging with a student body requires an all-in commitment, both in teaching and advising roles. Faculty primarily working elsewhere means cancelled classes. Faculty wedded to a company means advice that’s colored by the interest of the company.</p>
<p>The proposal harms our future because it will stifle innovation. University researchers have a great historical record of disruptive entrepreneurism — for example, Google dates back to a paper from the Stanford digital library project. Smooth transitions from academic research to industrial practice are widely encouraged: most universities allow faculty to consult at 20% time, do year-long sabbaticals in industry, or take leave to start companies in order to promote such transitions. But there’s a big difference between an industrial leave and a long-term commitment. You can’t do disruptive entrepreneurism if 80% of what you do is owned by a big company. Part of the point of being a big company is to control your environment by crushing, containing, or co-opting inconvenient innovations. Faculty who sign on are subject to a huge gravitational force and are <a href="https://newsroom.fb.com/news/2017/12/hard-questions-is-spending-time-on-social-media-bad-for-us/">hard pressed not to annoy the big company they work for</a>.</p>
<p>Like many really dangerous bargains, the harms are diffuse, and the benefits are focused. One kind of benefit is for faculty who sign on: in addition to the higher industrial salaries, working at a big company provides a chance to lead a team of research engineers to execute large-scale projects that may be used by millions. But another, more alarming, benefit is for big companies: all those potentially disruptive or potentially annoying ideas are now owned or controlled by the big company. Perhaps that’s
<del>the point of</del> why management supports the proposal.</p>
<p>If industry really wants to help scale and advance computer science research, it’s easy to do. Do what many companies are already doing, but do much more of it. Give fellowships to graduate students and scholarships to undergraduate students. Employ students as interns. Pay for named chairs and new buildings. Give lots of faculty small amounts of research money. Make and publish open datasets. Give us easy access to industrial scale computing resources. But don’t raid our faculty and tell us it’s good for us.</p>
<p><em>We have made a small edit to clear up a misunderstanding raised by a colleague. We have noted this change with strikethrough. Though comments are closed, you can follow the discussion on <a href="https://twitter.com/beenwrekt/status/1027915117076336640">Twitter</a>, <a href="https://www.reddit.com/r/MachineLearning/comments/963pek/r_you_cannot_serve_two_masters_the_harms_of_dual/">Reddit</a> and <a href="https://news.ycombinator.com/item?id=17734877">Hacker News</a>.</em></p>
Thu, 09 Aug 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/08/09/co-employment/
http://benjamin-recht.github.io/2018/08/09/co-employment/An Outsider's Tour of Reinforcement Learning<h2 id="table-of-contents">Table of Contents.</h2>
<ol>
<li><a href="http://www.argmin.net/2018/01/29/taxonomy/">Make It Happen.</a> Reinforcement Learning as prescriptive analytics.</li>
<li><a href="http://www.argmin.net/2018/02/01/control-tour/">Total Control.</a> Reinforcement Learning as Optimal Control.</li>
<li><a href="http://www.argmin.net/2018/02/05/linearization/">The Linearization Principle.</a> If a machine learning algorithm does crazy things when restricted to linear models, it’s going to do crazy things on complex nonlinear models too.</li>
<li><a href="http://www.argmin.net/2018/02/08/lqr/">The Linear Quadratic Regulator.</a> A quick intro to LQR as why it is a great baseline for benchmarking Reinforcement Learning.</li>
<li><a href="http://www.argmin.net/2018/02/14/rl-game/">A Game of Chance to You to Him Is One of Real Skill.</a> Laying out the rules of the RL Game and comparing to Iterative Learning Control.</li>
<li><a href="http://www.argmin.net/2018/02/20/reinforce/">The Policy of Truth.</a> Policy Gradient is a Gradient Free Optimization Method.</li>
<li><a href="http://www.argmin.net/2018/02/26/nominal/">A Model, You Know What I Mean?</a> Nominal control and the power of models.</li>
<li><a href="http://www.argmin.net/2018/03/13/pg-saga/">Updates on Policy Gradients.</a> Can we fix policy gradient with algorithmic enhancements?</li>
<li><a href="http://www.argmin.net/2018/03/20/mujocoloco/">Clues for Which I Search and Choose.</a> Simple methods solve apparently complex RL benchmarks.</li>
<li><a href="http://www.argmin.net/2018/04/19/pid/">The Best Things in Life Are Model Free.</a> PID control and its connection to optimization methods popular in machine learning.</li>
<li><a href="http://www.argmin.net/2018/04/24/ilc/">Catching Signals That Sound in the Dark.</a> PID for iterative learning control.</li>
<li><a href="http://www.argmin.net/2018/05/02/adp/">Lost Horizons.</a> Relating popular techniques from RL to methods from Model Predictive Control.</li>
<li><a href="http://www.argmin.net/2018/05/11/coarse-id-control/">Coarse-ID Control.</a> Combining high-dimensional statistics and robust optimization for the data-driven control of uncertain systems.</li>
<li><a href="http://www.argmin.net/2018/06/25/rl-tour-fin/">Towards Actionable Intelligence.</a></li>
</ol>
<p><strong>Bonus Post:</strong> <a href="http://www.argmin.net/2018/03/26/performance-profiles">Benchmarking Machine Learning with Performance Profiles</a>. The Five Percent Nation of Atari Champions.</p>
Mon, 25 Jun 2018 00:00:01 +0000
http://benjamin-recht.github.io/2018/06/25/outsider-rl/
http://benjamin-recht.github.io/2018/06/25/outsider-rl/Towards Actionable Intelligence<p>I’m going to close my outsider’s tour of Reinforcement Learning by announcing the release of a <a href="https://arxiv.org/abs/1806.09460">short survey of RL</a> that coalesces my views from the perspectives of continuous control.
Though the RL and controls communities remain practically disjoint, I’ve learned from writing this series that the two have much more to learn from each other than either care to admit. I think that some of the most pressing and exciting open problems in machine learning lie at the intersection of these fields. How do we damp dangerous feedback loops in machine learning systems? How do we build safe autonomous systems that reliably improve human conditions? How do we design systems that automatically adapt to changing environments and tasks? These are all challenges that will only be solved with novel innovations in machine learning <em>and</em> controls.</p>
<p>Perhaps the intersection of machine learning and controls needs a new name so that researchers can stop arguing about territory. I personally am fond of <em>Actionable Intelligence</em> as it sums up not only robotics but smarter, safer analytics. But at the end of the day, I don’t really care what we call the new area: the important part is that there is a large community spanning multiple disciplines that is invested making progress on these problems. Hopefully this tour has set the stage for a lot of great research at the intersection of machine learning and controls, and I’m excited to see what progress the communities can make working together.</p>
<h2 id="unbounded-acknowledgements">Unbounded Acknowledgements</h2>
<p>There are countless individuals who helped to shape the contents of my writing of this blog series and survey. I greatly appreciated the lively debates started on this blog and continued on Twitter. I hope that even those who disagree with my perspectives here find their input incorporated into follow ups here and the survey. Indeed, though most of the material in the survey first appeared on this blog, but for the survey, I’ve dropped the “outsider” bit. Through writing this blog and through the many lively discussions with people inside and outside RL, I feel like I finally understand the nuances of the area and the challenges the field faces moving forward.</p>
<p>I’d like to thank Chris Wiggins for sharing his taxonomy on machine learning, Roy Frostig for shaping my views on direct policy search, Pavel Pravdin for consulting on how to get policy gradient methods up and running, Max Raginsky for perspectives on adaptive control and translations of Russian. I’d like to thank Moritz Hardt, Eric Jonas, and Ali Rahimi for helping to shape the language, rhetoric, and focus of the blog series. I’d also like to thank Nevena Lazic, Gergely Neu, and Stephen Wright for many helpful suggestions for improving the readability and accuracy of the survey. This work was generously supported in part by two forward looking programs at DOD, namely the Mathematical Data Science program at ONR and the Foundations and Limits of Learning program at DARPA.</p>
<p>Additionally, I’d like to thank my other colleagues in machine learning and control for many helpful conversations and pointers: Murat Arcak, Karl Astrom, Francesco Borrelli, John Doyle, Andy Packard, Anders Rantzer, Lorenzo Rosasco, Shankar Sastry, Yoram Singer, Csaba Szepesvari, and Claire Tomlin. I’d also like to thank my colleagues in robotics, Anca Dragan, Leslie Kaebling, Sergey Levine, Pierre-Yves Oudeyer, Olivier Sigaud, Russ Tedrake, and Emo Todorov for sharing their perspectives on what sorts of RL and optimization technology works for them and what challenges they face in their research. Hopefully this survey provides a blueprint for all of these folks and more to begin further collaborations.</p>
<p>I’d like to thank everyone who took CS281B with me in the Spring of 2017 where I first tried to make sense of the problems in learning to control. And most importantly, a big thanks everyone in my research group who has been wrestling with these ideas with me for the past several years. They have have done much of the research highlighted here and have also provided invaluable criticism on my writings here and have shaped my views on this space more than anyone else. In particular, Ross Boczar, Nick Boyd, Sarah Dean, Animesh Garg, Aurelia Guy, Qingqing Huang, Kevin Jamieson, Sanjay Krishnan, Laurent Lessard, Horia Mania, Nik Matni, Becca Roelofs, Ugo Rosolia, Ludwig Schmidt, Max Simchowitz, Stephen Tu, and Ashia Wilson.</p>
<p>Finally, a very special thanks to <a href="http://www.camoncoffee.de/">Camon Coffee</a> in Berlin for letting me haunt their shop while writing. Be sure to stop by next time you’re in Berlin.</p>
Mon, 25 Jun 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/06/25/rl-tour-fin/
http://benjamin-recht.github.io/2018/06/25/rl-tour-fin/Coarse-ID Control<p><em>This is the thirteenth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 14 is <a href="http://www.argmin.net/2018/06/25/rl-tour-fin">here</a>. Part 12 is <a href="http://www.argmin.net/2018/05/02/coarse-id-control/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>Can poor models be used in control loops and still achieve near-optimal performance? In recent posts, we’ve seen the answer is certainly “maybe.” <a href="http://www.argmin.net/2018/02/26/nominal">Nominal control</a> could learn a poor model of the double-integrator with 10 samples and still achieve high performance. Is this optimal for the LQR problem? Is it really just as simple as fitting parameters and treating your estimates as true?</p>
<p>The answer is not entirely clear. To see why, let’s revisit my very fake datacenter model: a three state system where the state $x$ represents the internal temperature of the racks and the control $u$ provides local cooling of each rack. We modeled this dynamical system with a linear model</p>
\[x_{t+1} = Ax_t + Bu_t+w_t\]
<p>Where</p>
\[A = \begin{bmatrix} 1.01 & 0.01 & 0\\ 0.01 & 1.01 & 0.01 \\ 0 & 0.01 & 1.01 \end{bmatrix}
\qquad \qquad B = I\]
<p>For $Q$ and $R$, I set $Q = I$ and $R= 1000 I$, modeling that the operator wanted to really reduce the electricity bill.</p>
<p>This example seems to pose a problem for nominal control: note that all of the diagonal entries of the true model are greater than $1$. If we drive the system with noise, the states will grow exponentially, and consequently, you’ll get a fire in your data center. So active cooling must certainly be applied. However, a naive least-squares solution might fit one of the diagonal entries to be less than $1$. Then, since we are placing such high cost on the controls, we might not try to cool that mode too much, and this would lead to a catastrophe.</p>
<p>So how can we include the knowledge that our model is just an estimate and not accurate with a small sample count? My group has been considering an approach to this problem called “Coarse-ID Control,” that tries to incorporate such uncertainty.</p>
<h2 id="coarse-id-ingredients">Coarse-ID Ingredients</h2>
<p>The general framework of Coarse-ID Control consists of the following three steps:</p>
<ol>
<li>Use supervised learning to learn a coarse model of the dynamical system to be controlled. I’ll refer to the system estimate as the <em>nominal system</em>.</li>
<li>Using either prior knowledge or statistical tools like the bootstrap, build probabilistic guarantees about the distance between the nominal system and the true, unknown dynamics.</li>
<li>Solve a <em>robust optimization</em> problem that optimizes control of the nominal system while penalizing signals with respect to the estimated uncertainty, ensuring stable, robust execution.</li>
</ol>
<p>This approach is an example of <em>Robust Control</em>. In robust control, we try to find a controller that works not only for one model, but all possible models in some set. In this case, as long as the true behavior lies in this set of candidate models, we’ll be guaranteed to find a performant controller. The key here is that we are using machine learning to identify not only the plant to be controlled, <em>but the uncertainty as well</em>.</p>
<p>The coarse-ID procedure is well illustrated through the case study of LQR. First, we can estimate $A$ and $B$ by exciting the system with a little random noise, measuring the outcome, and then solving a least-squares problem. We can then guarantee how accurate these estimates are <a href="https://arxiv.org/abs/1802.08334">using some heavy-duty probabilistic analysis</a>. And for those of you out there who smartly don’t trust theory bounds, you can also use a simple bootstrap approach to estimate the uncertainty set. Once we have these two estimates, we can pose a robust variant of the standard LQR optimal control problem that computes a controller that stabilizes all of the models that would be consistent with the data we’ve observed.</p>
<p>Putting all these pieces together, and leveraging some new results in control theory, my students Sarah Dean, Horia Mania, and Stephen Tu, post-doc Nik Matni, and I were able to combine this into the first <a href="https://arxiv.org/abs/1710.01688">end-to-end guarantee for LQR</a>. We derived non-asymptotic bounds that guaranteed finite performance on the infinite time horizon, and were able to quantitatively bound the gap between our solution and the best controller you could design if you knew the model exactly.</p>
<p>To be a bit more precise, suppose in that we have a state dimension $d$ and have $p$ control inputs. Our analysis guarantees that after $O(d+p)$ iterations, we can design a controller that will have low cost on the infinite time horizon. That is, we can guarantee that we stabilize the system (we won’t cause fires) after seeing only a finite amount of data.</p>
<h2 id="proof-is-in-the-pudding">Proof is in the pudding</h2>
<p>Let’s return to the data center problem to see how this does on real data and not just in theory. To solve the robust LQR problem, we end up solving a small semidefinite programming problem as <a href="https://arxiv.org/abs/1710.01688">described in our paper</a>. Though I know that most people are scared to run SDPs, for the size of the problems we consider, these are solved on my laptop in well under a second.</p>
<p>In the plots below we compare nominal control to two versions of the robust LQR problem. The blue line denotes performance when we tell the robust optimization solver what the actual distance is from the nominal model to the true model. The green curve depicts what happens when we estimate this difference between the models using a bootstrap simulation. Note that the green curve is worse, but not that much worse:</p>
<p class="center"><img src="/assets/rl/coarse-id/datacenter_cost_inf_600_iter.png" alt="controller performance" width="250px" />
<img src="/assets/rl/coarse-id/datacenter_stabilizing_600_iter.png" alt="stabilizing" width="250px" /></p>
<p>Note also that the nominal controller does tend to frequently find controllers that fail to stabilize the true system. The robust optimization really helps here to provide controllers that are guaranteed to find a stabilizing solution. On the other hand, in industrial practice nominal control does seem to work quite well. I think a great open problem is to find reasonable assumptions under which the nominal controller is stabilizing. This will involve some hairy analysis of perturbation of Ricatti equations, but it would really help to fill out the picture of when such methods are safely applicable.</p>
<p>And of course, let’s not leave out model-free RL approaches:</p>
<p class="center"><img src="/assets/rl/coarse-id/datacenter_cost_inf_5000_iter.png" alt="controller performance zoom out" width="220px" />
<img src="/assets/rl/coarse-id/datacenter_stabilizing_5000_iter.png" alt="stabilizing zoom out" width="220px" />
<img src="/assets/rl/coarse-id/legend.png" alt="legend" width="110px" /></p>
<p>Here we again see they are indeed far off their model-based counterparts. The x-axis has increased by a factor of 10, and yet even the approximate dynamic approach LSPI is not finding decent solutions. It’s worth remembering that not only are model-free methods sample hungry, but they fail to be safe. And safety is much more critical than sample complexity.</p>
<h2 id="pushing-against-the-boundaries">Pushing against the boundaries</h2>
<p>Since Coarse-ID Control works so well on LQR, I think it’s going to be very interesting to try to push its limits. I’d like to understand how this works on <em>nonlinear</em> problems. Can we propagate parametric uncertainties into control guarantees? Can we model nonlinear problems with linear models and estimate the nonlinear uncertainties? There are a lot of great open problems following up this initial work, and I want to expand on the big set of unsolved problems in the next post.</p>
Fri, 11 May 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/05/11/coarse-id-control/
http://benjamin-recht.github.io/2018/05/11/coarse-id-control/Lost Horizons<p><em>This is the twelfth part of <a href="http://www.argmin.net/outsider-rl.html">“An Outsider’s Tour of Reinforcement Learning.”</a> Part 13 is <a href="http://www.argmin.net/2018/05/11/coarse-id-control/">here</a>. Part 11 is <a href="http://www.argmin.net/2018/04/24/ilc/">here</a>. Part 1 is <a href="http://www.argmin.net/2018/01/29/taxonomy/">here</a>.</em></p>
<p>This series began by describing a view of reinforcement learning as optimal control with unknown costs and state transitions. In the case where everything is known, we know that dynamic programming generically provides an optimal solution. However, when the models and costs are unknown, or when the full dynamic program is intractable, we must rely on approximation techniques to solve RL problems.</p>
<p>How you approximate the dynamic program is, of course, the hard part. Bertsekas recently released a revised version of his seminal book on <a href="http://web.mit.edu/dimitrib/www/dpchapter.html">dynamic programming and optimal control</a>, and Chapter 6 of Volume 2 has a comprehensive survey of data-driven methods to approximate dynamic programming. Though I don’t want to repeat everything Bertsekas covers here, I think describing his view of the problem builds a clean connection to receding horizon control, and bridges the complementary perspectives of classical controls and contemporary reinforcement learning.</p>
<h2 id="approximate-dynamic-programming">Approximate Dynamic Programming</h2>
<p>While I don’t want to belabor a full introduction to dynamic programming, let me try, in as short a space as possible, to review the basics.</p>
<p>Let’s return to our classic optimal control problem:</p>
\[\begin{array}{ll}
\mbox{maximize}_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R[x_t,u_t] ]\\
\mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\
& \mbox{($x_0$ given).}
\end{array}\]
<p>Though we can solve this directly on finite time horizons using some sort of batch solver, there is an often a simpler strategy based on <em>dynamic programming</em> and the <em>principle of optimality</em>: If you’ve found an optimal control policy for a time horizon of length $N$, $\pi_1,\ldots, \pi_N$, and you want to know the optimal strategy starting at state $x$ at time $t$, then you just have to take the optimal policy starting at time $t$, $\pi_t,\ldots,\pi_N$. Dynamic programming then let’s us recursively find a control policy by starting at the final time and recursively solving for policies at earlier times.</p>
<p>On the infinite time horizon, letting $N$ go to infinity, we get a clean statement of the principle of optimality. If we define $V(x)$ to be the value obtained from solving the optimal control problem with initial condition $x$, then we have</p>
\[V(x) = \max_u \mathbb{E}_{e}\left[R[x,u] + V(f(x,u,e))\right]\,.\]
<p>This equation, known as Bellman’s equation, is almost obvious given the structure of the optimal control problem. But it defines a powerful recursive formula for $V$ and forms the basis for many important algorithms in dynamic programming. Also note that if we have a convenient way to optimize the right hand side of this expression, then we can find the optimal action by finding the $u$ that minimizes the right hand side.</p>
<p>Classic reinforcement learning algorithms like TD and Q-learning take the Bellman equation as a starting point, and try to iteratively solve for the value function using data. These ideas also form the underpinnings of now-popular methods like DQN. I’d again highly recommend Bertsekas’ survey describing the many different approaches one can take to approximately solve this Bellman equation. Rather than covering this, I’d like to use this as jumping off point to compare this viewpoint to that of receding horizon control.</p>
<h2 id="receding-horizon-control">Receding Horizon Control</h2>
<p>As we discussed in the previous posts, 95% of controllers are PID control. Of the remaining 5%, 95% of those are probably based on receding horizon control (RHC). RHC, also known as <em>model predictive control</em> (MPC), is an incredibly powerful approach to controls that marries simulation and feedback.</p>
<p>In RHC an agent makes a plan based on a simulation from the present until a short time into the future. The agent then executes one step of this plan, and then, based on what it observes after taking this action, returns to short-time simulation to plan the next action. This feedback loop allows the agent to link the actual impact of its choice of action with what was simulated, and hence can correct for model mismatch, noise realizations, and other unexpected errors.</p>
<p>Though I have heard MPC referred to as “classical control” whereas techniques like LSTD and Q-learning are more in the camp of “postmodern reinforcement learning,” I’d like to argue that these are just different variants of approximate dynamic programming.</p>
<p>Note that a perfectly valid expression for the value function $V(x_0)$ is the maximal value of the optimization problem</p>
\[\begin{array}{ll}
\max_{u_t} & \mathbb{E}_{e_t}[ \sum_{t=0}^N R[x_t,u_t] + V(x_{N+1})]\\
\mbox{subject to} & x_{t+1} = f(x_t, u_t, e_t)\\
& \mbox{($x_0$ given).}
\end{array}\]
<p>Here we have just unrolled the cost beyond one step, but still collect the cost-to-go $N$ steps in the future. Though this is trivial, it is again incredibly powerful: the longer we make the time horizon, the less we have to worry about the value function $V$ being accurate. Of course, now we have to worry about the accuracy of the state-transition map, $f$. But, especially in problems with continuous variables, it is not at all obvious which accuracy is more important in terms of finding algorithms with fast learning rates and short computation times. There is a tradeoff between learning models and learning value functions, and this is a tradeoff that needs to be better understood.</p>
<p>Though RHC methods appear fragile to model mismatch, because they are only as good as the model, the repeated feedback inside RHC can correct for many modeling errors. As an example, it’s very much worth revisiting the robotic locomotion tasks inside the MuJoCo framework. These tasks actually were designed to test the power of a nonlinear RHC algorithm developed by <a href="https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf">Tassa, Erez, and Todorov</a>.</p>
<p>Here’s a video of such a controller in action from the 2012:</p>
<div style="text-align: center">
<iframe width="315" height="315" src="https://homes.cs.washington.edu/~todorov/media/TassaIROS12.mp4" frameborder="0" allowfullscreen=""></iframe></div>
<p>Fast forward to 2:50 to see the humanoid model we discussed in the <a href="http://www.argmin.net/2018/03/20/mujocoloco">random search post</a>. Note that the controller works to keep the robot upright, even when the model is poorly specified. Hence, the feedback inside the RHC loop is providing a considerable amount of robustness to modeling errors. Also note that this demo does not estimate the value function at all. Instead, they simply truncate the infinite time-horizon problem. The receding horizon approximation is already quite good for the purpose of control.</p>
<p>Moreover, the video linked above solves for the controller in 7x real time in 2012. Which is really not bad, and probably with a dedicated engineer, this could be made into real time using up-to-date hardware. However, note that in 2013, the same research group published a <a href="https://homes.cs.washington.edu/~todorov/papers/ErezHumanoids13.pdf">cruder version of their controller that they used during the DARPA robotics challenge</a>. The video here is just as impressive:</p>
<div style="text-align: center">
<iframe width="420" height="315" src="https://homes.cs.washington.edu/~todorov/media/ErezHumanoids13.mp4" frameborder="0" allowfullscreen=""></iframe></div>
<p>All these behaviors were generated by MPC in real-time. The walking is not as what can be obtained from computationally intensive long-horizon trajectory optimization, but it looks considerably better than the sort of direct policy search gaits <a href="http://www.argmin.net/2018/03/20/mujocoloco">we discussed a previous post</a>.</p>
<h2 id="learning-in-rhc">Learning in RHC</h2>
<p>Is there a middle ground between expensive offline trajectory optimization and real time model-predictive control? I think the answer is yes in the very same way that there is middle ground between learning dynamical models and learning value functions. Performance of a receding control system can be improved by better modeling of the value function which defines the terminal cost. The better a model you make of the value function, the shorter a time horizon you need for simulation, and the closer you get to real-time operation. Of course, if you had a perfect model of the value function, you could just solve the Bellman equation and you would have the optimal control policy. But by having an approximation to the value function, high performance can still be extracted in real-time.</p>
<p>So what if we <em>learn</em> to iteratively improve the value function while running RHC? This idea has been explored in a project by my Berkeley colleagues <a href="https://arxiv.org/abs/1610.06534">Rosolia, Carvalho, and Borrelli</a>. In their “Learning MPC” approach, the terminal cost is learned by nearest neighbors. The terminal cost of a state is the value obtained last time you tried that state. If you haven’t visited that state, the cost is infinite. This formulation constrains the terminal condition to be in a state observed before. You can explore new ways to decrease your cost on the finite time horizon as long as you reach a state that you have already demonstrated is safe.</p>
<p>This nearest-neighbors approach to control works really well in practice. Here’s an demo of the method on an RC car:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/4kHDv9senpE" frameborder="0" allowfullscreen="" class="center"></iframe>
<p>After only a few laps, the learned controller works better than a human operator. Simple nearest-neighbors suffices to learn rather complex autonomous actions. And, if you’re into that sort of thing, you can even prove monotonic increase in control performance. Quantifying the actual learning rate remains open and would be a great problem for RL theorists out there to study. But I think this example cleanly shows how the gap between RHC methods and Q-learning methods is much smaller than it first appears.</p>
<h2 id="safety-while-learning">Safety While Learning</h2>
<p>Another reason to like this blended RHC approach to learning to control is that one can hard code in constraints on controls, states, and easily incorporate models of disturbance directly into the optimization problem. Some of the most challenging problems in control are how to execute safely while continuing to learn more about a system’s capability, and an RHC approach provides a direct route towards balancing safety and performance. <a href="http://www.argmin.net/2018/05/11/coarse-id-control/">In the next post</a>, I’ll describe an optimization-based approach to directly estimate and incorporate modeling errors into control design.</p>
Wed, 02 May 2018 00:00:00 +0000
http://benjamin-recht.github.io/2018/05/02/adp/
http://benjamin-recht.github.io/2018/05/02/adp/