This is the live blog of Lecture 18 of my graduate class “Convex Optimization.” A Table of Contents is here.
I have a love-hate relationship with convergence analysis. One of the core parts of optimization theory, convergence analysis is the mathematics investigating which problems algorithms will solve and the associated amount of computation required. On the one hand, it’s good to know that your code will terminate. Code that might not terminate is inevitable, but it might not be inevitable for your problem. It would be good to know at what granular level we can hope to predict the running times of algorithms.
Optimization is sadly not linear algebra. In numerical linear algebra, we can often derive exact estimates of how long computations will take. For example, we know precisely how long our algorithms take to multiply and factor matrices. Optimization algorithms differ from these linear algebra routines because they are almost always iterative, and iterative methods on real-valued problems aren’t amenable to such operation counting. We can count how many operations a particular iteration takes. We have a harder time pinning down how many iterations we’ll need.
So what’s the point of predicting the number of iterations? I have co-written a book and several papers on this topic. I’ve taught classes solely devoted to this question. And yet, after over a decade of work, I don’t have a satisfying answer. Maybe what I’d say is that convergence analyses, at their best, pin down the sorts of instances where you expect your method to do well. They are guides but not predictions. Convergence analyses should provide accessible guides for which optimization models we should expect to be able to solve in reasonable time.
These analyses can be very helpful. Consider the famous example of Dantzig’s simplex algorithm for solving linear programs. Dantzig proved it converged, but not how many iterations to expect. A simple counterexample of a perturbed hypercube shows that the method might take an exponential amount of time to find a solution. While the simplex method seems to work well in most instances, we can construct linear programming problems where it won’t converge in a reasonable amount of time. Do these instances look like your instances? Maybe not! But it’s worth identifying the pathological cases so that you know what to look out for when applying the simplex method.
A proof of exponential time worst-case complexity merely highlights that if your solver is taking a long time, it might be a problem with the instance, not the solver. If your code is running forever, maybe you can jigger your model a bit and make it run faster. If you know that a simple reformulation should lead to a faster solution, you should implement it!
The analyses are similar for gradient descent. Gradient descent analyses say that any reasonably chosen step sequence will converge eventually. I’ll present one such analysis in class today. Unavoidable in the analyses of gradient methods is a notion of condition number, which describes the local geometry of the optimization landscape. Condition number in these convex problems reasonably predicts poor performance and the convergence proof highlights this. If you can formulate your problem to be well-conditioned, you should.
Similarly, the analyses of Newton’s method suggest that if the algorithm converges, it converges extremely quickly. But the analysis also suggests that the region where this rapid convergence occurs is small, and we can’t guarantee much of anything outside that region. This reflects what we see in practice and motivates the damped Newton schemes run in most interior point solvers.
The problem with convergence analysis arises when we take them too literally. The danger of reading too much into these convergence analyses is that the convergence rates are worst-case analyses. Convergence analyses need two ingredients: an optimization algorithm and a set of optimization instances. You pick a class of functions you believe is general enough to contain the problems you care about, but you can have a few models in this class, like the pathological instances for the simplex method, where your algorithm fails. It tells us about pathologies that might happen. Whether they do happen on your problem might be hard to know in advance.
Reading too much into these analyses is bad practice. For example, a decade ago there was an academic frenzy about “Nesterov’s optimal method.” Optimizers wrote thousands of papers achieving “optimal rates of convergence” using an admittedly brilliant and intricate proof technique invented by Yuri Nesterov. But the algorithms were “optimal” in a very limited sense. Nesterov required the algorithms to form new iterates from a linear combination of the gradients of the past iterates. For this restricted class of algorithms, there are functions where the number of iterates required to achieve a certain precision is bounded below. However, the example Nesterov provides to prove the lower bound, minimizing a tridiagonal quadratic form, is trivially solvable by other means. Such instances can be solved by Newton’s method in linear time. So yes, Nesterov’s particular instance breaks gradient-based methods, but that doesn’t mean you necessarily need to derive the most optimal gradient-based methods. Instead, it might just mean you should consider switching to a Newton method.
Worst-case bounds give a useful but imperfect way to compare algorithms. A guide to practice need not be perfect.
"For example, we know precisely how long our algorithms take to multiply and factor matrices." . At least in the case of matrix multiplication the technically correct statement is we have an upper bound on how long it takes to multiply matrices. Current best is O(n^2.37) and I believe that O(n^2) is still believed to be the optimat number (perhaps n^{2 +\epsilon} is correct)