This is the second part of the live blog of Lecture 3 of my graduate class “Convex Optimization.” A Table of Contents is here.
In the last lecture, I argued we could verify a solution to a convex optimization problem was optimal by showing that two sets were separated by a hyperplane. If this is your first time seeing it, it’s a weird way of thinking about proof. We take something that feels like we need epsilons, deltas, and calculus and transform it into a question of geometry.
Take two disjoint convex sets, C and D. A separating hyperplane is one where C is contained on one side of the hyperplane and D is contained in the other. In algebraic terms, C and D are separated by a hyperplane if there is an affine function h(x) that is nonpositive on C and nonnegative on D.
Suppose these are two points, c from C and d from D, whose distance is equal to the distance between C and D. c is the closest point in the set C to the set D and vice versa. Take the line segment that connects c and d. Define the hyperplane that cuts through the midpoint of that segment and whose normal vector is the direction from c to d. Then this hyperplane separates C and D. Proof by picture:
One halfspace contains C and the other D. If you don’t like proofs by picture, you can verify this with a few lines of algebra (see Section 2.5.1 in BV).
It came up in class that this picture looks like the one shown in machine learning classes when introducing maximum margin classification. If you are into the machine learning terminology, c and d are support vectors, and the distance is twice the margin between the sets C and D. This isn’t a coincidence: the connections between convex programming and pattern classification have been around since the original heyday of the Perceptron. In 1961, Bill Highleyman noted that two sets were linearly separable in the sense of Rosenblatt if their convex hulls were disjoint. In 1965, Olvi Mangasarian then realized that pattern classification could be solved by linear programming, hence inventing what would be renamed support vector machine in the 1990s.
In the blasphemous language of machine learning, in order to have a strict and robust separation of two convex sets, you need to have a good margin. But what about when you don’t? The reason I defined separating in terms of “nonpositive” and “nonnegative” rather than “negative” and “positive” is that there are disjoint sets where there is no hyperplane achieving strict positivity and negativity. Take this example (which I couldn’t come up with in my head in class yesterday, but Dan Glaser and Aathreya Kadambi suggested it after class):
If you prefer equations to pictures, two sets here are
They are certainly disjoint, but there is no margin between the sets. The only separating hyperplane is the one where x=0. But this does not strictly separate the sets. The affine function is equal to zero for some points in C.
One case where there is always margin is separating a point x from a closed convex set C.1 This leads to a wild corollary. I argued in Lecture 1 that a closed convex set is equal to the intersection of all of the halfspaces that contain it. To see that this is true, take any point not in the set. Then, by the separating hyperplane theorem, there is a halfspace containing the convex set that doesn’t contain the point.
Now, why should we care about these separating hyperplane theorems other than because of these cute proofs? Here’s an example that came up in class (note to self: I will learn the name of the person who suggested this on Tuesday for proper attribution), but I had forgotten the rigorous statement:
Does there exist a solution x to the system of equations Ax=b, which has all nonnegative entries?
If you have a candidate solution, you can check it by verifying it’s nonnegative and plugging it into the system. How can you check if no solution exists? We can find a separating hyperplane.
Let C be the set of all nonnegative combinations of the columns of A. If you can find a hyperplane separating C and the point b, then the system doesn’t have a solution. If you work through the algebra, you’ll see that this amounts to finding a vector y such that ATy ≥ 0 and bTy < 0. We can verify one polyhedron is empty by finding a single point in another polyhedron.
This particular separation result is called Farkas’ Lemma. It is the first time convex programming duality has reared its head in the class. We can turn problems of minimization into problems of checking emptiness. We can turn problems of checking emptiness into searching for separating hyperplanes. And then we can turn problems of searching for separating hyperplanes into problems of maximization. To every convex optimization problem, there’s a dual convex optimization problem. This is a central theme in convex optimization.
Duality is why convex optimization is a relatively easy problem. Checking if there exists a point with optimal value less than some number can be done by plugging in a solution. Proving there exists no point with optimal value less than some other number can be done by providing a separating hyperplane. You can verify a hyperplane is valid by plugging it into the dual problem and making sure it’s feasible. This means that verifying upper bounds and lower bounds of the objective function is easy.2 This duality puts us one step away from proving we have efficient algorithms.
Since C is closed, its complement is open, so we can put a ball of some radius r around x and not intersect C. Separating the ball from C strictly separates the point from C.
For the complexity theorists, it means we’re in NP ⋂ coNP.
Mangasarian 1965 is a trip. If I'm reading it correctly (and I'm not sure I am), it's not quite support vector machines, in that the linear program simply decides separability of the sets, rather than finding a maximum margin separator? And to do the latter, one needs a quadratic program (to minimize the norm of the separator)?
Also interesting how I find myself recoiling a bit at the explicit bias terms eveywhere. It's just noise, man! Particularly if you're just trying to certify separability?
I’m slightly confused as to why finding y such that (A^T)y >= 0 and (b^T)x <= 0 implies that the system Ax=b has no non-negative solutions. As an example, if you take A to be the 2x2 identity matrix and b to be the first standard basis vector then I can set y to be the second standard basis vector and satisfy the conditions above. However, setting x to be the first standard basis vector gives a non-negative solution to Ax=b. Should the inequalities in the condition be made strict, since the conic combination of a finite set of vectors will give us a closed set, and you explained in this post how the separating hyperplane between a point and a closed set will always have some margin?