This is the live blog of Lecture 12 of my graduate class “Convex Optimization.” A Table of Contents is here.
My favorite linear map is function evaluation. If I have a real-valued function f and a point x, the map from f to f(x) is linear in f. If I have two scalars a and b and two functions f and g then (a f + b g) (x) = af(x) + bg(x). This little observation lets us solve a variety of optimization problems where the optimization variable is a function.
The foundational problems we can pose are interpolation problems. Interpolation asks us to find a function that goes through a specified set of outputs on a specified set of inputs. Particular xs must map to particular ys. We often call these “function fitting” problems as we constrain the function to fit a specified data set. For any fixed x and y, “f(x)=y” is a linear constraint. This means that problems that constrain the values of f on a list of points are convex feasibility problems. Interpolation is a convex problem. And if you want approximate interpolation, you can add the constraint f(x)=y+w and minimize some penalty on the w’s, just like we did in last week’s lecture on inverse problems.
We can use interpolation constraints as building blocks of general function fitting problems. I can take linear combinations of interpolation constraints and get more linear constraints. For example, if I want the average of x over some set to equal some value, this will also be a linear constraint. Similarly, the moments of f, given by integrals of the form
are linear constraints on f.
I can write restrictions on the variability of f as constraints as well. The derivative operator, mapping f to the derivative of f, is a linear operator. Thus, interpolation constraints of the gradient of a function are linear constraints. Moreover, constraints on the magnitude of the gradient at particular points are convex constraints. Similarly, you can constrain the smoothness of the function at points by penalizing the Hessian (applying the gradient twice). Bounds on the Hessian at points are convex constraints.
Now, can we solve the convex optimization problems associated with fitting functions? Function spaces are usually infinite dimensional, so we need to make some extra assumptions to get problems we can tractably solve with computers. The easiest and most common way to get tractability is to force f to be in the span of basis functions.
where fi(x) are fixed functions and the variables are now the coefficients c_i. Now the problem becomes a search for coefficients and is d-dimensional. All of the above constraints I listed become convex constraints on the coefficients of a basis expansion. This basis expansion could just be linear (where fi(x) is the ith component of x). You could choose low-degree polynomials, and the problem would be linear in the coefficients of the polynomial. If you were doing signal processing, you could choose trigonometric functions. Now you’d be interpolating with sinusoids. You could make a mesh and fit piecewise linear functions or splines to it. As long as the representation is linear in the parameters, you can probably pose your interpolation problem as a convex problem.
What if you don’t want to specify a basis? With a little bit of hand waving, we can take apparently infinite-dimensional problems and make them finite-dimensional. No matter the dimension of the primal variable, the dual problem of a convex optimization problem has a number of variables equal to the number of constraints in the primal problem. This means the dual problem of an interpolation problem will necessarily be finite-dimensional. Provided that you can compute the dual function, which requires solving an unconstrained minimization over a function space, you can often back out the function you care about using the KKT conditions.
The sketchy argument in this last paragraph plays a little fast and loose with the KKT conditions on function spaces, and strong duality only holds with appropriate convex optimization rigor in place. I managed to write a high-level blog here without a ton of equations, and it felt like a shame to spoil it with too many derivations. If there’s interest, I could write out the gory details of how this works with a bunch of equations tomorrow. Let me know in the comments.
Regardless, duality gives you a computational means to solve these problems. This idea reduces classic problems in robust control theory to interpolation problems. It is what enables the fitting of cubic splines. In machine learning, this weird dual fact was called the “Representer Theorem.” It’s why reproducing kernels are unavoidable in the theory of machine learning. It’s also why we used to spend so much time on the dual problem of support vector machines in machine learning classes. Convex optimization lets you solve a ton of interpolation problems, and maybe interpolation is all you need.
As course material, I may suggest examining the contents of wonderful book Optimization by Vector Spaces Methods (Luenberger) from the convex optimization viewpoint.
Luenberger explains all required without the explicit use of duality; yet, he uses the dual spaces of various spaces for the solution. He is very much inclined towards using the separating hyperplanes for the solution. I was able to follow his reasoning and explanations mostly, but while reading the text, it seemed to me that he is inventing new methods as he goes through chapters. Once I have realized that the techniques described in this book can be simply explained through the dual problem formulation in the context of this convex optimization course; I have reached somewhat more calm and clarity.
For example, the optimal control example in page 124, starting with "Example 2: Consider the problem of selecting the field current u(t) on [0,1] to derive a motor governed by d^2/dt^2 \theta(t) + d/dt \theta(t) = u(t) from the initial conditions \theta(0) = \dot{\theta}(0) = 0 to \theta(1) = 1, \dot{\theta}(1) = 0 in such a way to minimize max |u(t)| in 0<= t <=1" is an excellent example in my opinion to show the importance of constraints and how/why we need to interpolate them when the goal is to minimize the norm of an unknown function.
Also Luenberger solves the same problem to minimize energy of u(t) in [0,1] interval (page 66, Example 1) which is more standard; since it is the classical minimum norm solution of in L_2 space with a closed form solution. (Students with a good linear algebra background, the minimum norm solution of an under-determined linear equation system, should appreciate this more.)
"In machine learning, this weird dual fact was called the “Representer Theorem.” It’s why reproducing kernels are unavoidable in the theory of machine learning. It’s also why we used to spend so much time on the dual problem of support vector machines in machine learning classes."
Hmmm, if I may permit myself to add some color, I'd take a slightly different perspective.
Looking back at my thesis (https://dspace.mit.edu/handle/1721.1/17549), it was easy enough to apply the Representer Theorem to the primal SVM problem (Eqs. 2.13-2.15 in my thesis). The reason to derive the dual (Eqs. 2.23-2.25) was that the constraint structure was simpler to work with in the dual --- in the primal you had an inequality constraint per data point, each of which involved *all* the c_i's, while in the dual you had a single equality constraint on the alpha_i's and simple interval inequalities on each alpha_i individually. This made it easy to write a custom solver which exploited the sparsity of the alphas (the c's are of course equally sparse, but we didn't see how to exploit this), allowing us to "train" SVMs in less time than it took to compute the entire kernel matrix. And *that's* why we used the dual for SVMs. Or at least why I did. At least if I postrationalize it. At the beginning, I used it because my predecessor Edgar Osuna had used it.