As course material, I may suggest examining the contents of wonderful book Optimization by Vector Spaces Methods (Luenberger) from the convex optimization viewpoint.
Luenberger explains all required without the explicit use of duality; yet, he uses the dual spaces of various spaces for the solution. He is very much inclined towards using the separating hyperplanes for the solution. I was able to follow his reasoning and explanations mostly, but while reading the text, it seemed to me that he is inventing new methods as he goes through chapters. Once I have realized that the techniques described in this book can be simply explained through the dual problem formulation in the context of this convex optimization course; I have reached somewhat more calm and clarity.
For example, the optimal control example in page 124, starting with "Example 2: Consider the problem of selecting the field current u(t) on [0,1] to derive a motor governed by d^2/dt^2 \theta(t) + d/dt \theta(t) = u(t) from the initial conditions \theta(0) = \dot{\theta}(0) = 0 to \theta(1) = 1, \dot{\theta}(1) = 0 in such a way to minimize max |u(t)| in 0<= t <=1" is an excellent example in my opinion to show the importance of constraints and how/why we need to interpolate them when the goal is to minimize the norm of an unknown function.
Also Luenberger solves the same problem to minimize energy of u(t) in [0,1] interval (page 66, Example 1) which is more standard; since it is the classical minimum norm solution of in L_2 space with a closed form solution. (Students with a good linear algebra background, the minimum norm solution of an under-determined linear equation system, should appreciate this more.)
"In machine learning, this weird dual fact was called the “Representer Theorem.” It’s why reproducing kernels are unavoidable in the theory of machine learning. It’s also why we used to spend so much time on the dual problem of support vector machines in machine learning classes."
Hmmm, if I may permit myself to add some color, I'd take a slightly different perspective.
Looking back at my thesis (https://dspace.mit.edu/handle/1721.1/17549), it was easy enough to apply the Representer Theorem to the primal SVM problem (Eqs. 2.13-2.15 in my thesis). The reason to derive the dual (Eqs. 2.23-2.25) was that the constraint structure was simpler to work with in the dual --- in the primal you had an inequality constraint per data point, each of which involved *all* the c_i's, while in the dual you had a single equality constraint on the alpha_i's and simple interval inequalities on each alpha_i individually. This made it easy to write a custom solver which exploited the sparsity of the alphas (the c's are of course equally sparse, but we didn't see how to exploit this), allowing us to "train" SVMs in less time than it took to compute the entire kernel matrix. And *that's* why we used the dual for SVMs. Or at least why I did. At least if I postrationalize it. At the beginning, I used it because my predecessor Edgar Osuna had used it.
As course material, I may suggest examining the contents of wonderful book Optimization by Vector Spaces Methods (Luenberger) from the convex optimization viewpoint.
Luenberger explains all required without the explicit use of duality; yet, he uses the dual spaces of various spaces for the solution. He is very much inclined towards using the separating hyperplanes for the solution. I was able to follow his reasoning and explanations mostly, but while reading the text, it seemed to me that he is inventing new methods as he goes through chapters. Once I have realized that the techniques described in this book can be simply explained through the dual problem formulation in the context of this convex optimization course; I have reached somewhat more calm and clarity.
For example, the optimal control example in page 124, starting with "Example 2: Consider the problem of selecting the field current u(t) on [0,1] to derive a motor governed by d^2/dt^2 \theta(t) + d/dt \theta(t) = u(t) from the initial conditions \theta(0) = \dot{\theta}(0) = 0 to \theta(1) = 1, \dot{\theta}(1) = 0 in such a way to minimize max |u(t)| in 0<= t <=1" is an excellent example in my opinion to show the importance of constraints and how/why we need to interpolate them when the goal is to minimize the norm of an unknown function.
Also Luenberger solves the same problem to minimize energy of u(t) in [0,1] interval (page 66, Example 1) which is more standard; since it is the classical minimum norm solution of in L_2 space with a closed form solution. (Students with a good linear algebra background, the minimum norm solution of an under-determined linear equation system, should appreciate this more.)
"In machine learning, this weird dual fact was called the “Representer Theorem.” It’s why reproducing kernels are unavoidable in the theory of machine learning. It’s also why we used to spend so much time on the dual problem of support vector machines in machine learning classes."
Hmmm, if I may permit myself to add some color, I'd take a slightly different perspective.
Looking back at my thesis (https://dspace.mit.edu/handle/1721.1/17549), it was easy enough to apply the Representer Theorem to the primal SVM problem (Eqs. 2.13-2.15 in my thesis). The reason to derive the dual (Eqs. 2.23-2.25) was that the constraint structure was simpler to work with in the dual --- in the primal you had an inequality constraint per data point, each of which involved *all* the c_i's, while in the dual you had a single equality constraint on the alpha_i's and simple interval inequalities on each alpha_i individually. This made it easy to write a custom solver which exploited the sparsity of the alphas (the c's are of course equally sparse, but we didn't see how to exploit this), allowing us to "train" SVMs in less time than it took to compute the entire kernel matrix. And *that's* why we used the dual for SVMs. Or at least why I did. At least if I postrationalize it. At the beginning, I used it because my predecessor Edgar Osuna had used it.