This is the live blog of Lecture 4 of my graduate class “Convex Optimization.” A Table of Contents is here.
Though I made a big deal last week about how we could reduce all optimization problems to minimizing a linear cost subject to constraints, most people think of cost functions as the central part of optimization. Even though it’s easier to build models by specifying constraints, the act of optimization is easier to conceptualize using functions. Optimization implies a notion of pricing and the existence of a solution of minimal price. Moreover, if I can formulate an optimization problem as finding some point that achieves minimum cost, I can search for such a problem by finding clever ways to reduce the cost at each step.
Since Newton, mathematicians have realized that dynamic search methods can find solutions of minimum value. The most popular search method, usually attributed to Cauchy, is the method of steepest descent. This very greedy method tells you to find the direction that maximally decreases your cost and follow that direction.
Steepest descent is something you learn in calculus. The steepest descent direction at any point is in the direction of the negative gradient. If the gradient is zero, there is no local information about how to make improvements. However, there are weird functions even in 1D, like x3, where the gradient equals zero at the point x=0, but any finite step in the negative direction reduces the cost. Convex functions are the ones for which Cauchy’s method always finds the minimum cost solution. And the proof almost follows from the definition.
A function is convex if and only if the plane tangent to the graph of the function lies below the function at every point:
.
Writing this in math says that for all x and all v
So if the gradient is equal to zero at x0, that means f(x)>= f(x0) for all x. If you get your definitions right, all theorems are trivial.
Last week, we needed to talk about feasible cones and separating hyperplanes to test optimality. But the condition here for convex functions is so much simpler. A point is a global minimizer of a cost function if the gradient of the function at that point is equal to zero.
If you can model your cost with convex functions, a simple algorithm always finds the optimal solution. The only question that remains is whether you can model your problem with convex costs. Just like last week, the answer will be to have a brute force definition that is sometimes easy to check and a bunch of rules for building complex convex functions from simple ones.
For this modeling program, we should start with the more common definition of convex functions: a function is convex if, for any two points, the line segment connecting their function values lies above the graph of the function. That is, a convex function is one where the area above the graph is a convex set. Here’s the simplest picture.
Convex functions are all boring like this. Their boringness is their appeal.
Even in high dimensions, convex functions are roughly valley-shaped, perhaps with some sharper edges. But the functions you can prove convex are wild. Feel free to skim these examples, but here are some of the more exotic ones I’ve seen (oddly, all of them introduced to me by Joel Tropp).
If H hermitian, X positive definite, f(X) = -trace(exp(H + log(X)) is convex. Here exp and log are the matrix exponential and logarithm.
If X and Z are positive definite, f(X,Z) = trace( X (log X - log Z) - (X - Z)) is convex.
If f is any complex differentiable function on the subset of complex numbers with real parts between a and b, f(x) = supy | f(x+iy) | is convex on the set [a,b].
Wut. Verifying the convexity of these particular examples is a real pain. The first one even has a name associated with it (Lieb’s Theorem). But for those of us who just want to model, it’s better to take the building block approach, again looking at operations that preserve convexity and using these to build up examples.
Nonnegative Combinations. Any linear combination of convex functions with nonnegative coefficients is convex. You can prove this using the fact that inequalities are preserved under nonnegative combinations. This means that integrals of convex functions are convex, as are expected values.
Composition with Affine Functions. If f(x) is convex, A is a matrix, and b is a vector, then g(x) = f(Ax+b) is convex.
Maximization. The pointwise maximum of several convex functions is itself a convex function. In mathematics: f1, f2, …, fk are convex, then g(x) = maxi fi(x) is convex.
Partial minimization. If f(x,z) is convex, then inf_z f(x,z) is convex.
Composition with scalar functions. If f is convex, and g is convex and nondecreasing, then h(x) = g(f(x)) is convex.
With these simple rules, you can build up a giant library of convex functions. You could argue the only ones you really need are affine functions aT x + b. If you just think about this geometrically, a convex function is equal to the pointwise maximum of all of the affine functions that lie below its graph. This is the same as saying that a convex set is the intersection of all halfspaces that contain it.
But such sophisticated mathematical abstractions aren’t necessary for modeling. Instead, we just need to composition rules. We can get some outlandish functions with the building blocks alone:
The sum of the squared residual errors in a linear model
The sum of the k largest components of a vector
The negative geometric mean of a bunch of concave functions
The Kullback Liebler divergence between two probability distributions
The maximum eigenvalue of a positive semidefinite matrix
All of these functions are convex and can be proven so using the simple composition rules. And that means that they can be minimized by local search. Disciplined modeling guarantees efficient ways to find global minimizers. The only question (and it’s a hard one) is whether you can find a good model for your problem only using these rules.