If I ever teach this class again, I think I’m going to rename it “Applied Gradient Descent.” I’ll just go through the infinite set of cases in machine learning, optimization, and decision making where gradient descent solves the problem. I’ve been struck this semester by how often some dumb gradient method works about as well as anything else, even when there’s no cost function to minimize. Yesterday, we found another surprising example of the power of the gradient method: PID Control.
PID control is the backbone of commercial automation, and one of its most common applications is forcing signals to be constant. I want my house to be 20℃. I want the water in my sous vide bath to be 54℃. I want the water in my espresso machine to be 95℃. Despite the different sorts of heating elements and sensors these devices use, I can get all of these temperatures set precisely using PID control.
PID control only has three parameters. You measure the actual temperature and subtract off your target temperature. This difference is called the error signal. You compute the integral of the error signal, meaning the total sum of the error so far. You estimate the derivative of the error signal. And then you set your heating element to be some weighted sum of the error signal, its integral, and its derivative.
PID control has many nuances, but if you get it running, it always gets your error to zero. If PID control doesn’t burn down your kitchen, it will get the error signal to zero. That we can say something so universal about a control policy with only three parameters is pretty surprising. Why does it work?
The magic is in that integral term. There are more sophisticated arguments than the one I’m about to give, but let’s stick with simple for the blog: suppose you are running PI controller (no D term):
and it settles into some steady-state configuration. Call the steady state error e0. Now suppose ut converges to some constant value u0 as well. Then we have
The left-hand side converges to a constant as t goes to infinity. But if e0 isn’t equal to zero, the right-hand side diverges. The only possibility is that the steady-state error is zero.
It’s kind of wild. The arguments in favor of PID control don’t need sophisticated system models. If you can get the PID controller to work, it gets your error to zero, no matter what the system. In the loathsome parlance of reinforcement learning, PID control is model-free. Getting PID controllers to work certainly needs practice and intuition. But the fact that 95% of automation is PI control suggests that we can regularly implement robust, functional feedback loops.
What does this have to do with gradient descent? Gradient descent is running integral control when your error signal is the gradient of a function. For pretty much the same reason I just described, if you run gradient descent and it doesn’t diverge, it must converge to a point with zero gradient. I can make this more explicit by writing out how we’d implement an integral controller in discrete logic. If we define a signal
Then
Writing out the update rule for q explicitly shows our control law is internally running code identical to gradient descent. Instead of plugging in the “gradient of f” it plugs in the error signal.
You can make this formally true. Laurent Lessard, Andy Packard, and I not only showed that the algorithms are the same but that more advanced analyses of control systems could be applied to analyze gradient-based optimization methods as well. Much of the analysis developed for control systems was identical to that developed for optimization algorithms. They just used slightly different terms.
I still don’t have a great answer for why “gradient descent” is so universal. The initial intuition from calculus of chasing the direction of steepest descent is only part of the story. We’ve seen that gradient descent methods, when framed as iterative error correction, can solve prediction problems even when they are not descending at each iteration. Gradient descent accrues low regret in a variety of decision problems. And now, apparently, we’ve found gradient descent inside the vast majority of automated control systems in production. I definitely could fill a semester with more examples.
I don't like to be the type of guy who advertises his own work, but I guess I'll have to be that type of guy here. I've been obsessed with the question of why gradient descent (or mirror descent, more generally) is so universal for over a decade. In fact, right after I joined Illinois, I've been having regular chats about this with Angelia Nedić and Alex Olshevsky, but we didn't really get anywhere. Then both Angelia and Alex moved to other universities, and in the meantime I made some progress on that question together with Belinda Tzen, Anant Raj, and Francis Bach:
https://ieeexplore.ieee.org/document/10120759
Indeed, the "steepest descent" story is only local, but we were able to show in what sense mirror descent is optimal by viewing it as an infinite-horizon steady-state stabilization problem.