arg min blogMusings on systems, information, learning, and optimization.
http://benjamin-recht.github.io/
Nesterov's Punctuated Equilibrium<p><em>Ed. Note: this post is co-written with <a href="https://cs.stanford.edu/~rfrostig/">Roy Frostig</a>.</em></p>
<p>Following the remarkable success of AlphaGo, there has been a groundswell of interest in reinforcement learning for <a href="https://arxiv.org/abs/1702.06230">games</a>, <a href="https://research.googleblog.com/2016/03/deep-learning-for-robots-learning-from.html">robotics</a>, <a href="https://arxiv.org/abs/1611.01578">parameter tuning</a>, and even <a href="http://dl.acm.org/citation.cfm?id=3005750">computer networking</a>. In a landmark <a href="https://arxiv.org/abs/1703.03864">new paper</a> by Salimans, Ho, Chen, and Sutskever from OpenAI, the authors show that a particular class of genetic algorithms (called Evolutionary Strategies) gives excellent performance on a variety of reinforcement learning benchmarks. As optimizers, the application of genetic algorithms raises red flags and usually causes us to close browser windows. But fear not! As we will explain, the particular algorithm deployed happens to be a core method in optimization, and the fact that this method is successful sheds light on the peculiarities of reinforcement learning more than it does about genetic algorithms in general.</p>
<h2 id="evolution-strategies-is-gradient-ascent">Evolution Strategies is Gradient Ascent</h2>
<p>Let’s look at the Evolutionary Strategies (ES) algorithm proposed in the paper. The goal is to maximize some reward function $R(x)$ where $x$ is $d$-dimensional. Their algorithm computes the reward function at small perturbations away from its current state, and then aggregates the returned function values into a new state. To be precise, they sample a collection of $n$ random directions $\epsilon_i$ to be normally distributed with mean zero and covariance equal to the identity. Then the algorithm updates it state according to the rule.</p>
<script type="math/tex; mode=display">x_{t+1} = x_{t} + \frac{\alpha}{\sigma n} \sum_{i=1}^n R(x_t + \sigma \epsilon_i) \epsilon_i \,.</script>
<p>Why is this a reasonable update? Let’s simplify this and consider the case where $n=1$ first. In this case, the update reduces to this simple iteration</p>
<script type="math/tex; mode=display">x_{t+1} = x_{t} + \alpha g_\sigma^{(1)}(x_t)</script>
<p>where</p>
<script type="math/tex; mode=display">g_\sigma^{(1)}(x)= \frac{1}{\sigma} R(x + \sigma \epsilon) \epsilon\,.</script>
<p>This still looks weird! What is it saying exactly? It says that you should move along direction $\epsilon$ proportional to the cost. Larger costs means you should move more in that direction. Of course, if $R$ is negative, this could be weird: large negative costs cause you to move a long way in the negative direction of $\epsilon$. An update that you may find simpler to reason about is the following</p>
<script type="math/tex; mode=display">g_\sigma^{(2)}(x) = \frac{R(x + \sigma \epsilon) - R(x - \sigma \epsilon) }{2\sigma} \epsilon\,.</script>
<p>This update says to compute a finite difference approximation to the gradient along the direction $\epsilon$ and move along the gradient. What’s not immediately obvious (though it’s a trivial calculation) is that $g_\sigma^{(1)}$ and $g_\sigma^{(2)}$ have the same expected value.</p>
<p>The finite difference interpretation also helps to reveal that this algorithm is essentially an instance of stochastic gradient ascent on the reward $R$. To see this, remember from calculus that</p>
<script type="math/tex; mode=display">\lim_{\sigma \downarrow 0} \frac{R(x + \sigma \epsilon) - R(x - \sigma \epsilon) }{2\sigma} = \nabla R(x)^T \epsilon</script>
<p>And, moreover</p>
<script type="math/tex; mode=display">\mathbb{E}_\epsilon\left[\epsilon\epsilon^T \nabla R(x)\right] = \nabla R(x)</script>
<p>So, for small enough $\sigma$, the update $g^{(2)}_\sigma$ acts like a stochastic approximation to the gradient.</p>
<p>In the experiments by Salimans et al, they always use $g_\sigma^{(2)}$
rather than $g_\sigma^{(1)}$.
They refer to $g_\sigma^{(2)}$ as <em>antithetic sampling</em>, a rather clever term from the MCMC literature. Such antithetic sampling dramatically improves performance in their experiments.</p>
<p>This particular algorithm (ES with antithetic sampling) is precisely equivalent to the <a href="https://link.springer.com/article/10.1007/s10208-015-9296-2">derivative-free optimization method</a> analyzed by Nesterov and Spokoiny in 2010. Noting this equivalence allows us to explain some of the observed advantages of ES, and to suggest some possible enhancements.</p>
<h2 id="reduce-your-variants">Reduce your variants</h2>
<p>Why does $g_\sigma^{(2)}$ perform better than $g_\sigma^{(1)}$? The answer is simply that though they have the same expected value $g_\sigma^{(2)}$ has significantly lower variance. To see why, let’s study the very boring but fundamental problem of maximizing a quadratic function</p>
<script type="math/tex; mode=display">R(x)=\frac{1}{2}x^TQx +p^Tx + r</script>
<p>Then we can explicitly write out the two updates:</p>
<script type="math/tex; mode=display">g_\sigma^{(1)}(x)= R(x) \epsilon+ \epsilon\epsilon^T\nabla R(x) + \epsilon \epsilon^T Q\epsilon</script>
<script type="math/tex; mode=display">g_\sigma^{(2)}(x)= \epsilon\epsilon^T \nabla R(x)</script>
<p>Note that $g_\sigma^{(2)}$ has two fewer terms. And these terms can be quite detrimental to convergence. First the $R(x)$-term depends on this nuisance offset $r$. Large values of $r$ essentially tell the algorithm using $g^{(1)}$ that all directions are equivalent. No optimization algorithm worth its salt should be sensitive to this offset. Second, the term $\epsilon \epsilon^T Q\epsilon$ has variance proportional to $d^3$ and that is quite undesirable.</p>
<p>What happens when we batch, as Salimans et al do in their paper? Nesterov does not study this in detail in his 2010 paper. In this case, we have a sum of directions. In the case that the $\epsilon_i$ were all orthogonal, this would be akin to moving along the gradient in a random subspace. But if $n$ is much smaller than $d$, this is pretty much exactly what happens: we move along a finite difference approximation to the gradient of $R$ in a random subspace. So this algorithm is very similar to random coordinate ascent. And we wouldn’t be too surprised if choosing random coordinates rather than random subspace directions performed comparably well on these problems.</p>
<p>Now this is where things start to get interesting. In this <a href="http://videolectures.net/deeplearning2016_abbeel_deep_reinforcement/?q=abbeel">excellent tutorial</a>, Pieter Abbeel describes using finite difference methods for solving reinforcement learning. This is a well-studied idea that, for some reason, fell out of favor as opposed to cross-entropy or policy gradient methods. We haven’t quite figured out <em>why</em> it fell out of favor. But in light of this recent work from OpenAI, perhaps the reason is that the overhead of computing the finite difference approximation on <em>all</em> of the coordinates was too costly. As the experiments show cleanly, using a small subset of the coordinate directions is computationally inexpensive and finds excellent directions to improve reward on many benchmarks in the OpenAI Gym.</p>
<p>Nesterov’s theoretical analysis helps to elucidate how many coordinates one should descend upon. Nesterov shows that his random search algorithm requires no more than $d$ times the iterations required by the gradient method. If you minibatch with batch size $m$, the number of iterations goes down by roughly a factor of $m$. But there are diminishing returns with respect to batch size, and eventually you are better off computing full gradients. Moreover, even when there is variance reduction, the number of function calls stays the same: each minibatch requires $m$ function evaluations, so the total number of function evaluations is still $d$ times the number of steps required by the gradient method.</p>
<p>Thus, in a serial setting, minibatching might hurt you. In theory, you can’t get a linear reduction in iterations with minibatches, and batches that are too large will slow down convergence. In the extreme, you are essentially just computing a finite difference approximation of the gradient. But in the parallel case, minibatching is great, since you can take advantage of embarrassing parallelism and receive a significant reduction in wall clock time even if the total number of function evaluations is larger than in the serial case.</p>
<h2 id="accelerated-evolution">Accelerated Evolution</h2>
<p>One of our favorite features of an optimization-centric viewpoint is that we can apply other widgets from the optimization toolkit to improve the performance of algorithms. A natural addition to this gradient-free algorithm is to add <em>momentum</em> to accelerate convergence. Acceleration is likely what Nesterov is best know for. Adding acceleration simply requires changing the procedure to</p>
<script type="math/tex; mode=display">x_{t+1} = (1+\beta) x_{t} - \beta x_{t-1}+ \frac{\alpha}{\sigma n} \sum_{i=1}^n R(x_t + \sigma \epsilon_i) \epsilon_i</script>
<p>This one-line change is simple to implement in the parallel algorithm proposed by Salimans et al. and could provide further speedups over standard policy gradient methods. I suppose if we wanted to merge universes, we could call this “Nesterov’s accelerated evolution.”</p>
<h2 id="use-your-gradients">Use your gradients</h2>
<p>Would this random search technique work in training neural nets for supervised learning? The answer depends on how much time you have: if your neural net model has a few million parameters, this finite difference approach would likely need a million times as many iterations as gradient descent. As Nesterov says “if you have gradients, you should use them!”</p>
<p>A deeper question is: why do finite difference methods work well for reinforcement learning in the first place? We’ll propose reasons in our next post. Essentially, model-free reinforcement learning <em>is</em> derivative free optimization. If the only access you have to the behavior of a system is through querying the reward given a policy, you never get derivatives of the reward. The conceit of classic methods like policy gradient is that they convince you that you are doing gradient descent, but the gradient you descend upon is not the gradient of the function you are trying to optimize! We will flesh this out in more detail in our next post.</p>
Mon, 03 Apr 2017 07:00:00 +0000
http://benjamin-recht.github.io/2017/04/03/evolution/
http://benjamin-recht.github.io/2017/04/03/evolution/The Fall of BIG DATA<p>I’m still in total shock from the decision my country made last Tuesday. We elected a hateful, bigoted, misogynistic, incompetent demagogue to lead us into a dark and foreboding future. While the internet has been flooded with hot takes about why this happened, I’d like to reflect a bit about why I am so crushed by this outcome.</p>
<p>I have been a machine learning researcher for nearly 15 years. I have been enthralled by the promise of data-driven methods to enrich our lives and make the impossible possible. This election is a resounding indictment of the information infrastructures we’ve built to inform ourselves. And I am reaching out to the machine learning community to come to terms with this fact and to do better.</p>
<p>There are three major failures of this cycle that are mostly the fault of our infatuation with data. The first is polling. The science of polling was shown to be beyond fallible, with completely incorrect voter screens and projections. While <a href="http://andrewgelman.com/2016/11/11/election-surprise-three-ways-thinking-probability/">Gelman</a> and others argue that we can learn from a mistake much like we learn from the black box of a crashed plane. But we currently fly <a href="http://www.transtats.bts.gov/Data_Elements.aspx?Data=2">tens of thousands of flights per day</a> in our domestic airspace and have had zero fatalities in 2016. This was achieved by rigorous scientific analysis, careful engineering, extensive regulatory oversight, and long training, not simply by reverse-engineering crashed planes, one after another. Statisticians arguing that rare events occur does not provide a way forward to robustify our methods of devoting resources to voter turnout or persuasion. Moreover, we treat polling like BIG DATA with sophisticated polling manipulation and averaging, even though we have less than 10 relevant elections to use to fit our models. There is simply no way to analyze the polls without overfitting.</p>
<p>The second major failure is in targeted news on social media – virality is proving fatal to truth in political discourse. Here, the success of BIG DATA led to a major failure in the democratic process. I’m disheartened to hear that <a href="https://www.washingtonpost.com/national/zuckerberg-that-facebook-influenced-election-is-crazy/2016/11/11/2ff14280-a822-11e6-ba46-53db57f0e351_story.html">Mark Zuckerberg won’t acknowledge</a> the role Facebook played in spreading disinformation in the 2016 campaign. <a href="http://www.niemanlab.org/2016/05/pew-report-44-percent-of-u-s-adults-get-news-on-facebook/">More than half of the country</a> gets its news from social media, and when that news is targeted it simply feeds into confirmation bias. Our community has developed remarkably effective tools to microtarget advertisements. <em>But if you use ad models to deliver news, that’s propaganda.</em> And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible.</p>
<p>And the third major failure has been a general apathy about politics amongst my colleagues here in the Bay Area. When many of the best minds in machine learning have decided that the most existential threat to civilization is the <a href="http://www.newyorker.com/magazine/2016/10/10/sam-altmans-manifest-destiny">rise of Skynet</a>, we have had a major failure of group think. Many ML researchers are more concerned with trying to bring about The Singularity, than in solving real problems. People are suffering all around us, and many of them are suffering precisely because of our <a href="http://www.themoneyillusion.com/?p=31847">advances in automation</a>. On top of this, 2016 is going to be the warmest year on record. If we devote the majority of our talents and resources to sci-fi navel gazing, then we are gravely failing the world with our neglect.</p>
<p>But I think we can be better. I think machine learning can be a powerful tool for social good. I think scientific minds are crucial to moving the world in a more positive direction. But we must now make this decision as a community. I am heartened by <a href="https://twitter.com/sama/status/796259060521652224">Sam Altman’s call to action</a>. But now is the time to put your money and talents where your mouths are.</p>
<p><a href="http://www.motherjones.com/kevin-drum/2016/11/three-things-remember">Kevin Drum</a> made some very important points that I want to reiterate and expand upon here:</p>
<ul>
<li>
<p>“We have elected a loudmouth, race-baiting game show host president of the United States.” This man has appointed an openly racist, antisemitic, misogynist as senior advisor. This man has appointed a climate change denier to head his EPA transition. This man has said on national television that he wants to repeal Roe vs Wade.</p>
</li>
<li>
<p>This election was very close. This is not a universal condemnation of progressive values. A few thousand votes in a few places would have resulted in a completely different outcome. And if we want to strive to achieve that outcome, it is time to become more active and vigilant. We have to be better. And we have to be better now.</p>
</li>
<li>
<p>Regardless of our mobilization, there are many people threatened by this new regime in America. Our muslim and Latino friends have been openly targeted. The president-elect has called for nationalizing “stop-and-frisk.” We have a lot of resources in the machine learning community. Some of us are very wealthy. Others hold positions of influence. We need to use this power to protect those who are threatened by this new regime.</p>
</li>
</ul>
<p>We must act, and we must act now. I am hoping that we can put our heads together to work for good in spite of this tremendous set back. I want to write a blogpost about the rise of BIG DATA. About how we used our technical acumen to help each other, protect those endangered, and save our fragile environment. That requires action and mobilization. And it has to happen now.</p>
<p>If you are up for a constructive conversation on how to move forward, please leave a comment. I don’t think any of us have a concrete plan yet on how to act, but I hope we can work towards something positive together.</p>
Mon, 14 Nov 2016 07:00:00 +0000
http://benjamin-recht.github.io/2016/11/14/fall-of-big-data/
http://benjamin-recht.github.io/2016/11/14/fall-of-big-data/Embracing the Random<p><em>Ed. Note: this post is again in my voice, but co-written with <a href="http://people.eecs.berkeley.edu/~kjamieson/about.html">Kevin Jamieson</a>. Kevin provided all of the awesome plots, and has a <a href="http://people.eecs.berkeley.edu/~kjamieson/hyperband.html">great tutorial </a> for implementing the algorithm I’ll describe in this post</em></p>
<p>In the last post, I argued that random search is a competitive method for black-box parameter tuning in machine learning. This is actually great news! Random search is a incredibly simple algorithm, and if it is as powerful as anything else we’ve come up with so far, we can devote our time to optimizing random search for the particularities of our workloads, rather than worrying about baking off hundreds of new algorithmic ideas.</p>
<p>In some very nice recent work, Kevin Jamieson and Ameet Talwalkar pursued a <a href="http://arxiv.org/abs/1502.07943">very nice direction</a> in accelerating random search. Their key insight is that most of the algorithms we run are iterative in machine learning, so if we are running a set of parameters, and the progress looks terrible, it might be a good idea to quit and just try a new set of hyperparameters.</p>
<p>One way to implement such a scheme called <em>successive halving</em>. The idea successive halving is remarkably simple. We first try out $N$ hyperparameter settings for some fixed amount of time $T$. Then, we keep the $N/2$ best performing algorithms and run for time $2T$. Repeating this procedure $\log_2(M)$ times, we end up with $N/M$ configurations run for $MT$ time.</p>
<p>The total amount of computation in each halving round is equal to $NT$, and there are $\log_2(M)$ total rounds. If we restricted ourself to pure random search with the same computation budget (i.e., $NT\log_2(M)$ time steps) and required each of the chosen parameter settings to be run for $MT$ steps, we would only be able to run $N \log_2(M)/M$ hyperparameter settings. Thus, in the same amount of time, successive halving sees $M/\log_2(M)$ more parameter configurations than pure random search!</p>
<p>Now, the problem here is that just because an parameter setting looks bad at the beginning of a run of SGD, doesn’t meant that it won’t be optimal at the end of the run. We see this a lot when tuning learning rates: slow learning rates are often poor for the first couple of epochs, but end up with the lowest test error after a hundred passes over the data.</p>
<p>A simple way to deal with this tradeoff between breadth and depth is to start the halving process later. We could run $N/2$ parameter settings for time $2T$, then the top $N/4$ for time $4T$ and so on. This adapted halving scheme allows slow learners to have more of a chance of surviving before being cut, but the total amount of time per halving round is still $N$ and the number of rounds is at most $\log_2(M)$. Running multiple instances of successive halving with different halving times increases depth while narrowing depth.</p>
<p>Following up on Kevin and Ameet’s initial work, Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar recently provided a <a href="http://arxiv.org/abs/1603.06560">simple, automatic way to balance these breadth-versus-depth tradeoffs</a>. The algorithm is remarkably simple: see Kevin’s <a href="http://people.eecs.berkeley.edu/~kjamieson/hyperband.html">project page</a> for the 7 lines of python code. The only parameters you need to know to generate a search protocol is the minimum amount of time you’d like to run your model before checking against other models and the maximum amount of time you’d ever be interested in running for. In the examples above, the minimum time was $T$ and the maximum time was $MT$. In some sense, these parameters are more like constraints: there is some overhead with checking the process, and the minimum time should be larger than this. The maximum runtime is here because we all have deadlines. The only parameter that needs to be given to Kevin’s code is $M$ (which he calls <code class="highlighter-rouge">max_iter</code>).</p>
<p>Successive halving was inspired by an earlier heuristic by Evan Sparks and coauthors which showed the simple idea of killing iterative jobs based on early progress <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2015/07/163-sparks.pdf">worked really well in practice</a>. The version described above is an adaptation of the algorithm proposed by Karnin, Koren, and Somekh for
<a href="http://jmlr.org/proceedings/papers/v28/karnin13.pdf">stochastic Multi-armed bandits</a>. Li <em>et al</em> provide a multiround scheme (which they call Hyperband) that adapts the scheme of Karnin <em>et al</em> to finite horizon, non-stochastic search. Li <em>et al</em> describe a number of extensions of the scheme for other machine learning workloads, many interesting theoretical guarantees, and implications for stochastic infinite-armed bandit problems (if you’re into that sort of thing). Let’s wrap up this blogpost with some empirical evidence that this algorithm actually works.</p>
<h2 id="neural-net-experiments">Neural net experiments</h2>
<p>Let’s look at three small image classification benchmarks: CIFAR-10, the Street View House
Numbers (SVHN), and rotated MNIST with background images (MRBI). The CIFAR-10 and SVHN data sets contain 32 × 32 RGB images. MRBI contains 28 x 28 grayscale images. Each dataset is split into a training, validation, and test set: (1) CIFAR-10 has 40,000, 10,000, and 10,000 instances; (2) SVHN has close to 600,000, 6,000, and 26,000 instances. (3) MRBI has 10,000 , 2,000, and 50,000 instances for training, validation, and test respectively.</p>
<p>For the experts out there, the goal is to tune the basic <a href="https://code.google.com/p/cuda-convnet/">cuda-convnet model</a>, searching for the optimal learning rate, learning rate decay, $\ell_2$ regularization parameters on different layers, and parameters of the response normalizations. For both datasets, the basic unit of time, $T$, was 10,000 examples. For CIFAR-10 this was one-fourth of an epoch, and $M$ was 300, equivalent to 75 epochs over the 40,000 example training set. For SVHN, $T$ corresponded to one-sixtieth of an epoch and $M$ was set to 600, equivalent to 10 epochs over the 600,000 example training set. For MRBI, $T$ was one epoch and $M$ was 300. The full details of these experiments are described in the paper. The plots below compare the performance of the Hyperband algorithm to a variety of other hyperparameter tuning algorithms. In particular, as raised in the comments in the previous post, we are comparing to <a href="https://github.com/JasperSnoek/spearmint">Spearmint</a>, a very popular Bayesian optimization scheme and <a href="http://ijcai.org/Proceedings/15/Papers/487.pdf">SMAC-early</a> which is a variant of SMAC that is designed to incorporate early stopping. The following plots are curves of the mean of 10 trials (<em>not the min of 10 trials</em>).</p>
<p class="center"><img src="/assets/hyperband/cifar10-compare.png" alt="Comparison of methods on CIFAR-10" />
<img src="/assets/hyperband/svhn-compare.png" alt="Comparison of methods on SVHN" /></p>
<p class="center"><img src="/assets/hyperband/mrbi-compare.png" alt="Comparison of methods on MRBI" /></p>
<p>First, note that Random-2x is again a very competitive algorithm. None of the pure Bayesian optimization methods outperform Random-2x on all three data sets. Only SMAC-early, with its ability to stop underperforming jobs, is able to consistently outperform Random-2x.</p>
<p>The comparison to Hyperband, on the other hand, is striking. On average, Hyperband finds a decent solution in a fraction of the time of all of the other methods. It also finds the best solution over all in all three cases. On SVHN, it finds the best solution in a fifth of the time of the other methods. And, again, the protocol is just 7 lines of python code. Hyperband is just a first step, and other might not be the ideal solution for your particular workload. But I think these plots nicely illustrate how simple-but enhancements of random search can go a very long way.</p>
Thu, 23 Jun 2016 07:00:00 +0000
http://benjamin-recht.github.io/2016/06/23/hyperband/
http://benjamin-recht.github.io/2016/06/23/hyperband/The News on Auto-tuning<p><em>Ed. Note: this post is in my voice, but it was co-written with <a href="http://people.eecs.berkeley.edu/~kjamieson/about.html">Kevin Jamieson</a>. Kevin provided the awesome plots too.</em></p>
<p>It’s all the rage in machine learning these days to build complex, deep pipelines with thousands of tunable parameters. Now, I don’t mean parameters that we learn by stochastic gradient descent. But I mean architectural concerns, like the value of the regularization parameter, the size of a convolutional window, or the breadth of a spatio-temporal tower of attention. Such parameters are typically referred to as <em>hyperparameters</em> to contrast against the parameters learned during training. These structural parameters are not learned, but rather descended upon by a lot of trial-and-error and fine tuning.</p>
<p>Automating such hyperparameter tuning is one of the most holy grails of machine learning. And people have tried for decades to devise algorithms that can quickly prune bad configurations and maximally overfit to the test set. This problem is ridiculously hard, because the problems in question become mixed-integer, nonlinear, and nonconvex. The default approach to the hyperparameter tuning problem is to resort to <em>black-box optimization</em> where one tries to find optimal settings by only receiving function values and not using much other auxiliary information about the optimization problem.</p>
<p>Black-box optimization is hard. It’s hard in the most awful senses of optimization. Even when we restrict our attention to continuous problems, black-box optimization is completely intractable in high dimensions. To guarantee that you are within a factor of two of optimality requires an exponential number of function evaluations. Roughly the number of queries scales as $O(2^d)$ where $d$ is the dimension. What’s particularly terrible is that it easy to construct “needle-in-the-haystack” problems where this exponential complexity is real. That is, where no algorithm will ever find a good solution. Moreover, it is hard to construct an algorithm that outperforms random guessing on these problems.</p>
<h2 id="bayesian-inference-to-the-rescue">Bayesian inference to the rescue?</h2>
<p>In recent years, I have heard that there has been a bit of a breakthrough for hyperparameter tuning based on Bayesian optimization. Bayesian optimizers model the uncertainty of the performance of hyperparameters using
priors about the smoothness of the hyperparameter landscape. When one tests a set of parameters, the uncertainty of the cost near that setting shrinks. Bayesian optimization then tries to explore places where the uncertainty remains high and the prospects for a good solution look promising. This certainly sounds like a sensible thing to try.</p>
<p>Indeed, there has been quite a lot of excitement about these methods, and there has been a lot of press about how well these methods work for tuning deep learning and other hard machine learning pipelines. However, <a href="http://arxiv.org/abs/1603.06560">recent evidence</a> on a benchmark of over a hundred hyperparameter optimization datasets suggests that such enthusiasm really calls for much more scrutiny.</p>
<p>The standard way these methods are evaluated in papers is by using rank plots. Rank plots aggregate statistics across datasets for different methods as a function of time: at a particular time, the solver with the best setting gets one point, the algorithm in second place two points, and so forth. Consider the following plots:</p>
<p class="center"><img src="/assets/hyperband/rank_chart.png" alt="Rank chart of various hyperparameter methods" />
<img src="/assets/hyperband/bar_plot_sample.png" alt="Bar plot comparing final test errors" /></p>
<p>On the left, we show the rank chart for all algorithms and on the right, we show the actual achieved function values of the various algorithms. The first plot represent the average score across 117 datasets collected by <a href="http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning">Feurer et. al. NIPS 2015</a> (lower is better). For clarity, the second plot is for a subset of these data sets, but all of the data sets have nearly identical results. We compare state-of-the-art Bayesian optimization methods SMAC and TPE to the method I suggested above: <em>random search</em> where we just try random parameter configurations and don’t use any of the prior experiments to help pick the next setting.</p>
<p>What are the takeaways here? While the rank plot seems to suggest that state-of-the-art Bayesian optimization methods SMAC and TPE resoundingly beat random search, the bar plot shows that they are achieving nearly identical test errors! That is, SMAC and TPE are only a teensy bit better than random search. Moreover, and more troubling, Bayesian optimization is completely outperformed by random search <em>run at twice the speed</em>. That is, if you just set up two computers running random search, you beat all of the Bayesian methods.</p>
<p>Why is random search so competitive? This is just a consequence of the curse of dimensionality. Imagine that your space of hyperparameters is the unit hypercube in some high dimensional space. Just to get the Bayesian uncertainty to a reasonable state, one has to essentially test all of the corners, and this requires an exponential number of tests. What’s remarkable to me is that the early <a href="http://arxiv.org/abs/0912.3995">theory</a> <a href="https://hal.inria.fr/hal-00654517/">papers</a> on Bayesian optimization are very up front about this exponential scaling, but this <a href="http://blog.sigopt.com/post/144221180573/evaluating-hyperparameter-optimization-strategies">seems to be ignored</a> by the current excitement in the Bayesian optimization community.</p>
<p>There are three very important takeaways here. First, if you are planning on writing a paper on hyperparameter search, you should compare against random search! If you want to be even more fair, you should compare against random search with twice the sampling budget of your algorithm. Second, if you are reviewing a paper on hyperparameter optimization that does not compare to random search, you should immediately reject it. And, third, as a community, we should be devoting a lot of time to accelerating pure random search. If we can speed up random search to try out more hyperparameter settings, perhaps we can do even better than just running parallel instances of random search.</p>
<p>In my next post, I’ll describe some <a href="http://arxiv.org/abs/1603.06560">very nice recent work</a> by Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar on accelerating random search for iterative algorithms common in machine learning workloads. I will dive into the details of their method and show how it is very promising for quickly tuning hyperparameters.</p>
Mon, 20 Jun 2016 07:00:00 +0000
http://benjamin-recht.github.io/2016/06/20/hypertuning/
http://benjamin-recht.github.io/2016/06/20/hypertuning/The Role of Convergence Analysis<p>This year marks the retirement of Dimitri Bertsekas from MIT. Dimitri is an idol of mine, having literally written the book on every facet of optimization. His seminal works on distributed optimization, dynamic programming, and Lagrangian methods remain the best references available. I had the privilege of taking Dimitri’s convex analysis course in grad school, and he would frequently burst into class beaming because he had stayed up until 2AM the night before simplifying an argument of Rockafellar’s down to elementary calculus.</p>
<p>My last post on Lagrangians was based on Chapter 3 of Dimitri’s Nonlinear Programming Book. Chapter 2 also happens to feature one of my favorite passages about the delicate balance between theory and practice in optimization. One of the trickiest parts about optimization (and a point I intend to repeatedly hammer on this blog) is realizing how many of the theorems are “qualitative” rather than “quantitative.” I wanted to just quote Dimitri’s text in full here, as I don’t think I could write it better. Best wishes to you in retirement!</p>
<h2 id="the-role-of-convergence-analysis-by-dimitris-bertsekas">The Role of Convergence Analysis by Dimitris Bertsekas</h2>
<p>The following subsection gives a number of mathematical propositions relating
to the convergence properties of gradient methods. The meaning of these propositions is usually quite intuitive but their statement often requires complicated mathematical assumptions. Furthermore, their proof often involves tedious $\epsilon-\delta$ arguments, so at first sight students may wonder whether “we really have to go through all this.”</p>
<p>When Euclid was faced with a similar question from King Ptolemy of Alexandria, he replied that “there is no royal road to geometry.” In our case, however, the answer is not so simple because we are not dealing with a pure subject such as geometry that may be developed without regard for its practical application. In the eyes of most people, the value of an analysis or algorithm in nonlinear programming is judged primarily by its practical impact in solving various types of problems. It is therefore important to give some thought to the interface between convergence analysis and its practical application. To this end it is useful to consider two extreme viewpoints; most workers in the field find themselves somewhere between the two.</p>
<p>In the first viewpoint, convergence analysis is considered primarily a mathematical subject. The properties of an algorithm are quantified to the extent possible through mathematical statements. General and broadly applicable assertions, and simple and elegant proofs are at a premium here. The rationale is that simple statements and proofs are more readily understood, and general statements apply not only to the problems at hand but also to other problems that are likely to appear in the future. On the negative side, one may remark that simplicity is not always compatible with relevance, and broad applicability is often achieved through assumptions that are hard to verify or appreciate.</p>
<p>The second viewpoint largely rejects the role of mathematical analysis. The rationale here is that the validity and the properties of an algorithm for a given class of problems must be verified through practical experimentation anyway, so if an algorithm looks promising on intuitive grounds, why bother with a convergence analysis. Furthermore, there are a number of important practical questions that are hard to address analytically, such as roundoff error, multiple local minima, and a variety of finite termination and approximation issues. The main criticism of this viewpoint is that mathematical analysis often reveals (and explains) fundamental flaws of algorithms that experimentation may miss. These flaws often point the way to better algorithms or modified algorithms that are tailored to the type of practical problem at hand. Similarly, analysis may be more effective than experimentation in delineating the types of problems for which particular algorithms are well-suited.</p>
<p>Our own mathematical approach is tempered by practical concerns, but we note that the balance between theory and practice in nonlinear programming is particularly delicate, subjective, and problem dependent.</p>
Fri, 10 Jun 2016 07:00:00 +0000
http://benjamin-recht.github.io/2016/06/10/analysis-in-optimization/
http://benjamin-recht.github.io/2016/06/10/analysis-in-optimization/Mechanics of Lagrangians<p><a href="http://www.argmin.net/2016/05/18/mates-of-costate/">In my last post</a>, I used a Lagrangian to compute derivatives of constrained optimization problems in neural nets and control. I took it for granted that the procedure was correct. But why is it correct? I suppose the simplest answer is because we arrived at the same procedure as back propagation. But that’s not a particularly satisfying answer, and it doesn’t give you any room to generalize.</p>
<p>In fact, if I’m really honest about it, none of the manipulations we do with Lagrangians in optimization are decidedly intuitive. Mechanistically, Lagrangians give powerful methods to derive algorithms, understand sensitivities to assumptions, and generate lower bounds. But the functionals themselves always just seem to pop out of thin air. Why are Lagrangian methods so effective in optimization, even when the associated problems are nonconvex?</p>
<p>In this post, I’m going to “derive” Lagrangians in two very different ways: one by pattern matching against the implicit function theorem and one via penalty functions. This basically follows the approach in Chapter 3 of Bertsekas’ <a href="http://www.athenasc.com/nonlinbook.html">Nonlinear Programming Book</a> where he introduces Lagrange multipliers and the KKT conditions. Most people know the KKT conditions as a necessary condition for optimality in nonlinear programming. How does it also arise in computing derivatives? It turns out that these two are actually quite connected, and if you have ever worked out a proof of the KKT conditions, you probably have also derived a correctness proof for the method of adjoints.</p>
<h2 id="implicit-functions">Implicit functions</h2>
<p>Let’s begin by attempting to formalize what it means to take a derivative of a function subject to constraints. Suppose we have a function $F:\mathbb{R}^{n+d} \rightarrow \mathbb{R}$ which we write as $F(x,z)$ where $x$ is $n$-dimensional and $z$ is $d$ dimensional. Additionally, assume we have a constraint function $H:\mathbb{R}^{d+n} \rightarrow \mathbb{R}^d$ which we want to be identically zero. If we want to take a derivative of $F(x,z)$ with respect to $x$ subject to the constraint $H(x,z)=0$, this means that we want to first eliminate the variable $z$ using the nonlinear equations $H(x,z)=0$. Let $\varphi(x)$ denote the solution of $H(x,z)=0$ with respect to $z$ (and let’s assume such a $z$ exists and is unique). Once we have solved for $z$, we then want to take a derivative of the <em>unconstrained</em> function $F(x,\varphi(x))$ with respect to $x$. Now, by the chain rule</p>
<script type="math/tex; mode=display">\nabla_x F(x,\varphi(x)) = \nabla_x F(x,z) + \nabla_x \varphi(x) \nabla_z F(x,z)\,.</script>
<p>What about the gradient of this function $\varphi$? We can compute its gradient by applying <a href="https://en.wikipedia.org/wiki/Implicit_function_theorem">the implicit function theorem</a>. Indeed, if $\nabla_z H(x,z)$ is invertible, the implicit function theorem gives an explicit formula for the gradient:</p>
<script type="math/tex; mode=display">\nabla_x \varphi(x) = - \nabla_x H(x,z)[\nabla_z H(x,z)]^{-1} \,.</script>
<p>With this expression in hand, we can apply some magical pattern matching. Define $p:= - [\nabla_z H(x,z)]^{-1} \nabla_z F(x,\varphi(x))$ and plug it into the formula above. Then, if $z=\varphi(x)$, we have</p>
<script type="math/tex; mode=display">\nabla_x F(x,z) = \nabla_x F(x,z) + \nabla_x H(x,z) p\,.</script>
<p>In other words, if we define the Lagrangian $\mathcal{L}(x,z,p) = F(x,z) + p^T H(x,z)$, we have that</p>
<script type="math/tex; mode=display">\nabla_x F(x,z) = \nabla_x \mathcal{L}(x,z,p)</script>
<p>where $(z,p)$ satisfy</p>
<script type="math/tex; mode=display">\nabla_z \mathcal{L}(x,z,p)=0~\mbox{and}~\nabla_p \mathcal{L}(x,z,p)=0\,.</script>
<p>The equations $\nabla \mathcal{L}=0$ are called the <em>KKT conditions</em> for the optimization problem. Any solution must satisfy these equations. But, following this derivation, it is obvious why the KKT conditions must hold: they are merely asserting that the derivative with respect to $x$ is zero once you have eliminated the constraints.</p>
<p>Even though I can explain this derivation and its consequences, I still find this pattern matching to be bizarrely coincidental. How exactly did the Lagrangian pop up here? Let me now derive the same optimality conditions in a completely different way, starting with a Lagrangian and yet recovering the exact same formula and see if this provides any additional insights.</p>
<h2 id="penalty-functions">Penalty functions</h2>
<p>My personal favorite motivation of the Lagrangian is in terms of saddle point problems. Consider the joint optimization problem</p>
<script type="math/tex; mode=display">\mbox{minimize}_{x,z}~\mbox{max}_{p}~F(x,z) +p^TH(x,z)</script>
<p>In the inner maximization problem, the supremum is infinite if $H(x,z)$ is nonzero. Thus, you only get a finite value when $H(x,z)=0$. In this case, the minimum value with respect to $x$ and $z$ is just the minimum value of $F(x,z)$ <em>subject to</em> $H(x,z)=0$. That is, it is completely equivalent to the constrained optimization problem we have been analyzing. So the Lagrangian penalty function enforces the equality constraint via a min-max game.</p>
<p>But why this penalty function? I like to think of the Lagrangian as a limit of more obvious penalty functions. If we set up the unconstrained minimization problem</p>
<script type="math/tex; mode=display">\mbox{minimize}_{x,z}~F(x,z) + \frac{1}{2\alpha} \| H(x,z) \|^2 \,,</script>
<p>with $\alpha>0$, it’s clear that as $\alpha$ tends to zero, the cost enforces the constraint $H(x,z)$ to be small. In the limit, we would expect that $H(x,z)$ would be zero and the corresponding minimizer should minimize $F$ subject to the constraint that $H$ vanishes. Let’s call this unconstrained minimization problem the “penalty formulation”</p>
<p>Now, consider the penalized min-max problem</p>
<script type="math/tex; mode=display">\mbox{min}_{x,z}~\mbox{max}_{p}~F(x,z) +p^TH(x,z) - \frac{\alpha}{2}\|p\|^2</script>
<p>This is an “augmented Lagrangian.” When $\alpha=0$, this is just the Lagrangian above, but when $\alpha>0$, the inner maximization problem always has finite values. In fact, the maximizing $p$ is always $H(x,z)/\alpha$. If we plug in this value, we can eliminate the Lagrange multiplier. But after this substitution, we are left with the penalty formulation! <em>The Lagrangian formulation of the optimization problem is the limit of penalized formulations as $\alpha$ goes to zero.</em></p>
<p>For the penalty formulation, the gradient of the cost function is zero at all of the stationary points. In particular, for the $z$ variable,</p>
<script type="math/tex; mode=display">\nabla_z F(x,z) + \frac{1}{\alpha} [\nabla_z H(x,z) ] H(x,z) = 0</script>
<p>But this means that</p>
<script type="math/tex; mode=display">p = \frac{1}{\alpha} H(x,z) = - [\nabla_z H(x,z)]^{-1} \nabla_z F(x,z)</script>
<p>This is exactly the same formula for $p$ as we derived using the implicit function theorem. This shouldn’t be too surprising as the optimal value of $p$ has a simple interpretation. Note that for very small $\Delta z$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
F(x,z + \Delta z )
&\approx F(x,z) + \frac{\partial F}{\partial z} \Delta z \\
&=F(x,z) - p^T \nabla_x H(x,z) \Delta z
\end{aligned} %]]></script>
<p>so each coordinate of $p$ controls how much the cost function changes as we perturb the constraints. $p$ measures how sensitive the cost is to small perturbations of the constraints.</p>
<p>Taking the limit as $\alpha$ goes to zero, we see that all local solutions must satisfy the KKT conditions $\nabla \mathcal{L}=0$, and the Lagrange multipliers have the form predicted by the implicit function theorem.</p>
<p>It’s important to note that this derivation for $p$ only holds at the stationary points of the Lagrangian. For actually computing derivatives, the implicit function theorem approach gives the correct form for the gradient at any point $x$. But I find it amazing that these two very differently motivated derivations arrive at exactly the same formulae.</p>
Tue, 31 May 2016 07:00:00 +0000
http://benjamin-recht.github.io/2016/05/31/mechanics-of-lagrangians/
http://benjamin-recht.github.io/2016/05/31/mechanics-of-lagrangians/Song of the week - Both Hands<p>David Bazan of Pedro the Lion fame just released a fantastic new record <em>Blanco</em>, and “Both Hands” is the leadoff track.</p>
<iframe src="https://embed.spotify.com/?uri=spotify:track:7qXAvcB4enZGEBBxu8GnTw" width="300" height="380" frameborder="0" allowtransparency="true"></iframe>
<p>I haven’t followed Bazan’s work since Pedro the Lion, but in the interim he’s shifted away from emo indie rock to washed out synth pop. There is a heavy low-pass filter over all of the instrumentation, and this makes for an appropriate pairing with Bazan’s disaffected baritone.</p>
<blockquote>
<p>If I’m not losing sleep/
I’m probably over it.</p>
</blockquote>
<p>I know how you feel, David.</p>
<p>For those of you without Spotify, you can check out Track 2, <em>Oblivion</em> on Bazan’s Soundcloud:</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/263460958&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false"></iframe>
<p>This song is also very high quality.</p>
Sat, 21 May 2016 00:07:00 +0000
http://benjamin-recht.github.io/2016/05/21/both-hands/
http://benjamin-recht.github.io/2016/05/21/both-hands/Mates of Costate<p>There have been four thousand new frameworks for deep learning thrown on the market the past year, and I bet you were wondering what you needed to jump into this hot marketplace. Essentially, there are two components required for most mortals who aim to train neural nets: a unit that efficiently computes derivatives of functions that are compositions of many sub-functions and a unit that runs stochastic gradient descent. I can write the stochastic gradient descent part in ten lines of python. I’ll sell it to the highest bidder in the comments. But what about the automatic differentiator?</p>
<p>Automatic differentiation does seem like a bit of a black box. Some people will just scoff and say “it’s just the chain rule.” But evaluating the chain rule efficiently requires being careful about reusing information, and not having to handle special cases. The backpropagation algorithm handles these recursions well. It is a dynamic programming method to compute derivatives, and uses clever recursions to aggregate the gradients of the components. However, I always find the derivations of backprop to be confusing and too closely tied to neuroscientific intuition that I sorely lack. Moreover, for some reason, dynamic programming always hurts my brain and I have to think about it for an hour before I remember how to rederive it.</p>
<p>A few years ago, <a href="http://pages.cs.wisc.edu/~swright/">Steve Wright</a> introduced me to an older method from optimal control, called the method of adjoints, which is equivalent to backpropagation. It’s also easier (at least for me) to derive. This is because the core of the method is <em>Lagrangian duality</em>, a topic at the foundation of everything we optimizers do.</p>
<h2 id="deep-neural-networks">Deep neural networks</h2>
<p>Before we get to Lagrangian duality, we need a constrained optimization problem. There’s no Lagrangian without some constraints! So let’s transform a deep learning optimization problem into a constrained optimization problem.</p>
<p>The standard deep learning goal is to solve optimization problems of the form</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{minimize}_{\varphi} &\frac{1}{n} \sum_{k=1}^n \mathrm{loss}(\varphi(x_k),y_k) ,
\end{array} %]]></script>
<p>where $\varphi$ is a function from features to labels that has an appropriate level of expressivity. In deep learning, we assume that $\varphi$ is a giant composition:</p>
<script type="math/tex; mode=display">\varphi(x;\vartheta) = f_\ell \circ f_{\ell-1} \circ f_{\ell-2} \circ \cdots \circ f_1(x)</script>
<p>and each $f_i$ has a vector of parameters $\vartheta_{i-1}$ which may be optimized. In this case, we can rewrite the unconstrained minimization problem as a constrained one:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{minimize}_{\vartheta} &\frac{1}{n} \sum_{k=1}^n \mathrm{loss}(z_k^{(\ell)},y_k) \\
\mbox{subject to} & z_k^{(\ell)} = f_\ell(z_{k}^{(\ell-1)}, \vartheta_{\ell})\\
& z_k^{(\ell-1)} = f_{\ell-1}(z_{k}^{(\ell-2)}, \vartheta_{\ell-1})\\
& \vdots\\
& z_k^{(1)} = f_1(x_k, \vartheta_{1}).
\end{array} %]]></script>
<p>Why does this help? Explicitly writing out the composition in stages is akin to laying out a computation graph for the function. And once we have a computation graph, we can use it to compute derivatives.</p>
<h2 id="the-method-of-adjoints">The method of adjoints</h2>
<p>The method of adjoints reveals the structure of the backpropagation algorithm by constructing a Lagrangian and computing the KKT conditions for the constrained optimization formulation. To simplify matters, let’s restrict our attention to the case where $n=1$. This corresponds to when there is a single $(x,y)$ data pair as you’d have if you were running stochastic gradient descent.</p>
<p>To derive the KKT conditions we first form a Lagrangian function with Lagrange multipliers $p_i$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{L} (x,u,p) &:= \mathrm{loss}(z^{(\ell)},y) \\
&\qquad\quad - \sum_{i=1}^{\ell} p_i^T(z^{(i)} - f_i(z^{(i-1)},\vartheta_i)),
\end{aligned} %]]></script>
<p>The derivatives of this Lagrangian are given by the expressions:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nabla_{z^{(i)}} \mathcal{L} &= - p_{i} + \nabla_{z^{(i)}} f_{i+1}(z^{(i)},\vartheta_{i+1})^T p_{i+1} , \\
\nabla_{z^{(\ell)}} \mathcal{L} &= -p_\ell + \nabla_{z^{(\ell)}} \mathrm{loss}(z^{(\ell)},y) , \\
\nabla_{\vartheta_i} \mathcal{L} &= \nabla_{\vartheta_i} f_i(z^{(i-1)},\vartheta_i)^Tp_i ,\\
\nabla_{p_i} \mathcal{L} &= z^{(i)} - f_i(z^{(i-1)},\vartheta_i).
\end{aligned} %]]></script>
<p>The Lagrange multipliers $p_i$ are also known as the <em>adjoint variables</em> or <em>costates</em>. To compute the gradient, we just have to solve the set of nonlinear equations</p>
<script type="math/tex; mode=display">\nabla_{p_i} \mathcal{L} = 0~\mbox{and}~ \nabla_{z_i} \mathcal{L} =0</script>
<p>and then we can just read off the gradient with respect to $\nabla_\vartheta \mathrm{loss}(\varphi(x;\vartheta),y)= \nabla_{\vartheta_i} f_i(z^{(i-1)},\vartheta_i)^Tp_i$.
(I’ll explain why later… trust me for a second).</p>
<p>The structure here is particularly nice. If we solve for $\nabla_{p_i} \mathcal{L}=0$, this just amounts to satisfying the constraints $z^{(i)} = f_i(z^{(i-1)})$. This is called the <em>forward pass</em>. We can then compute $p_i$ from the equations $\nabla_{z_i} \mathcal{L} =0$. That is,</p>
<script type="math/tex; mode=display">p_\ell = \nabla_{z^{(\ell)}} \mathrm{loss}(z^{(\ell)},y) \,.</script>
<p>This is the <em>backward pass</em>. The gradients with respect to the parameters can then be computed by adding up linear functions of the adjoint variables.</p>
<p>What is particularly nice about the method of adjoints is that it suggests a convenient set of working variables that enable fast gradient computation. It explicitly builds in a caching strategy for subunits of the computation. Two different constrained formulations will lead to different computation graphs and sets of costates, but they will return the same gradient.</p>
<p>There are tons of ways to generalize this. We could have a more complicated computation graph. We could share variables among layers (this would mean adding up variables). We could penalize hidden variables or states explicitly in the cost function. Regardless, we can always read off the solution from the same forward-backward procedure. The computation graph always provides a “forward model” describing the evolution of an input to the output. The adjoint equation involves the adjoint (“transpose”) of the Jacobians of this equation, which measures the sensitivity of one node to the previous node.</p>
<h2 id="adjoints-in-optimal-control">Adjoints in Optimal Control</h2>
<p>As I mentioned already, the method of adjoints originates in the study of controls. According to <a href="http://arc.aiaa.org/doi/abs/10.2514/3.25422">Dreyfus</a>, this was first proposed by Bryson in a paper called “A Gradient Method for Optimizing Multi-Stage
Allocation Processes” that appeared in the <em>Proceedings of the Harvard University Symposium
on Digital Computers and Their Applications</em> in 1961. I was unable to find this proceedings in our Engineering Library, but the Lagrangian derivation plays a prominent role in Bryson and Ho’s 1968 book <a href="http://www.amazon.com/Applied-Optimal-Control-Optimization-Estimation/dp/0891162283">Applied Optimal Control</a>. Note that Bryson’s paper appeared only a year after as Kalman’s absurdly influential <a href="http://fluidsengineering.asmedigitalcollection.asme.org/article.aspx?articleid=1430402">A New Approach to Linear Filtering and Prediction Problems</a>. This use of duality was very much at the birth of modern control theory.</p>
<p>Let’s take the simplest and most studied optimal control problem and see what backpropagation computes. In optimal control, we have a dynamical system with state variable $x_t$ and input $u_t$. We assume the state evolves according to the linear dynamics</p>
<script type="math/tex; mode=display">x_{t+1} = A x_t + B u_t~\mbox{for}~t=0,1,\ldots\,.</script>
<p>where $(A,B)$ are some known state-evolution equations.</p>
<p>Suppose we would like to find a sequence of inputs $u_t$ that minimizes some quadratic cost over the trajectory:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{minimize}_{u_t,x_t} \, & \tfrac{1}{2}\sum_{t=0}^N \left\{x_t^TQ x_t + u_t^T R u_t\right\} \\
& \qquad + \tfrac{1}{2} x_{N+1}^T S x_{N+1}, \\
\mbox{subject to} & x_{t+1} = A x_t+ B u_t, \\
& \qquad \mbox{for}~t=0,1,\dotsc,N,\\
& \mbox{($x_0$ given).}
\end{array} %]]></script>
<p>The Lagrangian for this system has a similar form to that for the neural network</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{L} (x,u,p) &:= \sum_{t=0}^N \left[ \tfrac{1}{2} x_t^TQ x_t + \tfrac{1}{2}u_t^T R u_t \right.\\
&\qquad\qquad \left. - p_t^T (x_{t+1}-A x_t - B u_t) \right]\\
&\qquad\qquad +\tfrac{1}{2} x_{N+1}^T S x_{N+1}.
\end{aligned} %]]></script>
<p>The gradients of the Lagrangian are given by the expressions</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nabla_{x_t} \mathcal{L} &= Qx_t - p_{t-1} + A^T p_i , \\
\nabla_{x_{N+1}} \mathcal{L} &= -p_N + S x_{N+1} , \\
\nabla_{u_t} \mathcal{L} &= R u_t + B^T p_t , \\
\nabla_{p_t} \mathcal{L} &= -x_{t+1} + Ax_t + B u_t.
\end{aligned} %]]></script>
<p>Again, to satisfy $\nabla_{p_i} \mathcal{L}=0$, we simply run the dynamical system model forward in time to compute the trajectory $x_t$. Then, we can solve for the costates $p_i$ by running the <em>adjoint dynamics</em></p>
<script type="math/tex; mode=display">p_{t-1} = A^T p_t + Q x_t</script>
<p>with the initial condition $p_N = Sx_{N+1}$. For the optimal control problem, the Lagrange multipliers are a trajectory of a related linear system called the <em>adjoint</em> or <em>dual</em> system. The dynamics are linear in the costate $p_t$, with time running in reverse and the state transition matrix being the transpose (also known as the adjoint) of $A$. The costate is driven by the forward trajectory $x_t$. This gives us a clear way to think about the dynamics about how later states are sensitive to early states. In the special case when $Q$ and $R$ are zero, we are computing the sensitivity of the end state $x_{N+1}$ to the inputs $u_t$. If $A$ is <em>stable</em>, meaning all of its eigenvalues have magnitude strictly less than $1$, than early inputs have little effect on the terminal state. But if $A$ is <em>unstable</em>, the costate dynamics may diverge, and hence the gradient with respect to $u_t$ for small $t$ can grow exponentially large.</p>
<p>In the special case where the cost involves tracking an observation $y_t$, we arrive at the cost function of Kalman’s Filter:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{array}{ll}
\mbox{minimize}_{u_t,x_t} \, & \tfrac{1}{2}\sum_{t=0}^N \left\{\|x_t-y_t\|^2+ u_t^T R_t u_t \right\}\\
&\qquad\qquad+ \tfrac{1}{2}x_0^T S x_0\\
\mbox{subject to} & x_{t+1} = A x_t+ B u_t, \\
& \qquad \mbox{for}~t=0,1,\dotsc,N\,.
\end{array} %]]></script>
<p>One could solve the Kalman Filtering problem by performing gradient descent on the cost and computing the gradient via the method of adjoints. This would be a totally reasonable solution, akin to solving a tridiagonal system via conjugate gradient. However, the special structure of this system enables us to solve the normal equations in linear time, so most people don’t compute their filters this way. On the other hand, the method of adjoints is far more general than the Kalman Filter as it immediately applies to nonlinear dynamical systems or the nonquadratic costs. Moreover, the iterations require only $O(N)$ operations even in the general case. This method is quite useful when the constraints are defined by partial differential equations, as there is an associated adjoint PDE that enables optimization in this setting as well. Lions has a <a href="http://www.springer.com/us/book/9783642650260">whole book</a> on this topic.</p>
<p>And, if you wanted to be crazy and make the control policy $u_t$ to be the output of a neural network applied to $x_t$, one could still compute gradients using the method of adjoints.</p>
<h2 id="why-is-this-the-derivative">Why is this the derivative?</h2>
<p>So why is this Lagrangian procedure correct? The KKT conditions are a necessary condition for stationarity in nonlinear programming. It’s not particularly obvious why this should also give a way to compute derivatives. In the next post, I will show how the method of adjoints is intimately connected to the KKT conditions. I will describe how the proof of the KKT conditions also provides a proof of correctness for the method of adjoints. And I’ll also describe other algorithms that naturally arise when one views a cascade of function compositions as a dynamical system.</p>
Wed, 18 May 2016 07:00:00 +0000
http://benjamin-recht.github.io/2016/05/18/mates-of-costate/
http://benjamin-recht.github.io/2016/05/18/mates-of-costate/