Mangasarian 1965 is a trip. If I'm reading it correctly (and I'm not sure I am), it's not quite support vector machines, in that the linear program simply decides separability of the sets, rather than finding a maximum margin separator? And to do the latter, one needs a quadratic program (to minimize the norm of the separator)?

Also interesting how I find myself recoiling a bit at the explicit bias terms eveywhere. It's just noise, man! Particularly if you're just trying to certify separability?

Ah, the good old days when you had to worry about whether your hyperplane went through the origin.

You're right that the program only decides the separation of sets. Smith proposed to fix this by minimizing the hinge loss in 1968 (https://ieeexplore.ieee.org/abstract/document/1687347). Grinold showed how you could add one extra constraint to Mangasarian's program and get a good margin separator in the separable case and minimizes max error in the non-separable case (https://www.jstor.org/stable/168677?seq=1).

I don't think I can make time to dive too deeply into classification problems from the 1965-1970 period, but I spent a few minutes with Smith 1968. It looks like the bulk of the paper is talking about an iterative method for finding W, which is going to lead to some weight regularization just by virtue of running out of steps? Sort of an implicit early stopping argument?

Later in the paper I see at least one point where they're mentioning minimizing an L1-norm on |w|, but I can't really tell quickly how this is connected to anything algorithmic.

I guess for linear classifiers, you can probably replace the L2-norm regularizer with an L1-normalizer and get away with a linear program rather than a quadratic program? (I'm pretty sure someone actually did this at some point and it pissed me off; if I'd known about these 1967-era papers it probably would've pissed me off more? I could be misremembering?)

But once you move to the nonlinear case, you're reliant on the representer theorem which turns your L2 regularizer into c^t K c and then you're basically forced into quadratic programming?

I’m slightly confused as to why finding y such that (A^T)y >= 0 and (b^T)x <= 0 implies that the system Ax=b has no non-negative solutions. As an example, if you take A to be the 2x2 identity matrix and b to be the first standard basis vector then I can set y to be the second standard basis vector and satisfy the conditions above. However, setting x to be the first standard basis vector gives a non-negative solution to Ax=b. Should the inequalities in the condition be made strict, since the conic combination of a finite set of vectors will give us a closed set, and you explained in this post how the separating hyperplane between a point and a closed set will always have some margin?

Mangasarian 1965 is a trip. If I'm reading it correctly (and I'm not sure I am), it's not quite support vector machines, in that the linear program simply decides separability of the sets, rather than finding a maximum margin separator? And to do the latter, one needs a quadratic program (to minimize the norm of the separator)?

Also interesting how I find myself recoiling a bit at the explicit bias terms eveywhere. It's just noise, man! Particularly if you're just trying to certify separability?

Ah, the good old days when you had to worry about whether your hyperplane went through the origin.

You're right that the program only decides the separation of sets. Smith proposed to fix this by minimizing the hinge loss in 1968 (https://ieeexplore.ieee.org/abstract/document/1687347). Grinold showed how you could add one extra constraint to Mangasarian's program and get a good margin separator in the separable case and minimizes max error in the non-separable case (https://www.jstor.org/stable/168677?seq=1).

I don't think I can make time to dive too deeply into classification problems from the 1965-1970 period, but I spent a few minutes with Smith 1968. It looks like the bulk of the paper is talking about an iterative method for finding W, which is going to lead to some weight regularization just by virtue of running out of steps? Sort of an implicit early stopping argument?

Later in the paper I see at least one point where they're mentioning minimizing an L1-norm on |w|, but I can't really tell quickly how this is connected to anything algorithmic.

I guess for linear classifiers, you can probably replace the L2-norm regularizer with an L1-normalizer and get away with a linear program rather than a quadratic program? (I'm pretty sure someone actually did this at some point and it pissed me off; if I'd known about these 1967-era papers it probably would've pissed me off more? I could be misremembering?)

But once you move to the nonlinear case, you're reliant on the representer theorem which turns your L2 regularizer into c^t K c and then you're basically forced into quadratic programming?

I’m slightly confused as to why finding y such that (A^T)y >= 0 and (b^T)x <= 0 implies that the system Ax=b has no non-negative solutions. As an example, if you take A to be the 2x2 identity matrix and b to be the first standard basis vector then I can set y to be the second standard basis vector and satisfy the conditions above. However, setting x to be the first standard basis vector gives a non-negative solution to Ax=b. Should the inequalities in the condition be made strict, since the conic combination of a finite set of vectors will give us a closed set, and you explained in this post how the separating hyperplane between a point and a closed set will always have some margin?

Good catch! You need (b^T)y to be strictly less than zero. I had a typo, which I have now fixed. Thanks!

Meant (b^T)y <= 0 above, rather than (b^T)x.