*This is the live blog of Lecture 12 of my graduate class “Convex Optimization.” A Table of Contents is here.*

My favorite linear map is function evaluation. If I have a real-valued function f and a point x, the map from f to f(x) is linear in f. If I have two scalars a and b and two functions f and g then (a f + b g) (x) = af(x) + bg(x). This little observation lets us solve a variety of optimization problems where the *optimization* *variable* is a function.

The foundational problems we can pose are *interpolation* problems. Interpolation asks us to find a function that goes through a specified set of outputs on a specified set of inputs. Particular xs must map to particular ys. We often call these “function fitting” problems as we constrain the function to fit a specified data set. For any fixed x and y, “f(x)=y” is a linear constraint. This means that problems that constrain the values of f on a list of points are convex feasibility problems. *Interpolation* is a convex problem. And if you want approximate interpolation, you can add the constraint f(x)=y+w and minimize some penalty on the w’s, just like we did in last week’s lecture on inverse problems.

We can use interpolation constraints as building blocks of general function fitting problems. I can take linear combinations of interpolation constraints and get more linear constraints. For example, if I want the average of x over some set to equal some value, this will also be a linear constraint. Similarly, the moments of f, given by integrals of the form

are linear constraints on f.

I can write restrictions on the variability of f as constraints as well. The derivative operator, mapping f to the derivative of f, is a linear operator. Thus, interpolation constraints of the gradient of a function are linear constraints. Moreover, constraints on the magnitude of the gradient at particular points are *convex* constraints. Similarly, you can constrain the smoothness of the function at points by penalizing the Hessian (applying the gradient twice). Bounds on the Hessian at points are convex constraints.

Now, can we solve the convex optimization problems associated with fitting functions? Function spaces are usually infinite dimensional, so we need to make some extra assumptions to get problems we can tractably solve with computers. The easiest and most common way to get tractability is to force f to be in the span of basis functions.

where f_{i}(x) are fixed functions and the variables are now the coefficients c_i. Now the problem becomes a search for coefficients and is d-dimensional. All of the above constraints I listed become convex constraints on the coefficients of a basis expansion. This basis expansion could just be linear (where f_{i}(x) is the ith component of x). You could choose low-degree polynomials, and the problem would be linear in the coefficients of the polynomial. If you were doing signal processing, you could choose trigonometric functions. Now you’d be interpolating with sinusoids. You could make a mesh and fit piecewise linear functions or splines to it. As long as the representation is *linear* in the parameters, you can probably pose your interpolation problem as a convex problem.

What if you don’t want to specify a basis? With a little bit of hand waving, we can take apparently infinite-dimensional problems and make them finite-dimensional. No matter the dimension of the primal variable, the *dual problem *of a convex optimization problem has a number of variables equal to the number of constraints in the primal problem. This means the dual problem of an interpolation problem will necessarily be finite-dimensional. Provided that you can compute the dual function, which requires solving an unconstrained minimization over a function space, you can often back out the function you care about using the KKT conditions.

The sketchy argument in this last paragraph plays a little fast and loose with the KKT conditions on function spaces, and strong duality only holds with appropriate convex optimization rigor in place. I managed to write a high-level blog here without a ton of equations, and it felt like a shame to spoil it with too many derivations. If there’s interest, I could write out the gory details of how this works with a bunch of equations tomorrow. Let me know in the comments.

Regardless, duality gives you a computational means to solve these problems. This idea reduces classic problems in robust control theory to interpolation problems. It is what enables the fitting of cubic splines. In machine learning, this weird dual fact was called the “Representer Theorem.” It’s why reproducing kernels are unavoidable in the theory of machine learning. It’s also why we used to spend so much time on the dual problem of support vector machines in machine learning classes. Convex optimization lets you solve a ton of interpolation problems, and maybe interpolation is all you need.

]]>*This is the third part of the live blog of Lecture 11 of my graduate class “Convex Optimization.” A Table of Contents is here.*

A common theme in the comments and offline discussions of inverse problems has been how to deal with nonconvexity. I realized that for the application section of the class, it might be worth having a few complementary posts airing *the limits* of convexity. There’s a lot of upside to sticking with convex formulations, but there’s a lot we currently don’t know how to do with convex optimization. Perhaps this is because it’s impossible, but perhaps it’s because no one has figured it out yet. You can consider these “limits of convexity” posts to be free course project ideas.

I could do a whole series on the limits of convexity in inverse problems, and it definitely needs more attention than a single blog post to do this real justice. But I like artificial constraints on my blogging, so I’m going to do my best within my live lecture blogging model. Consider the following to be stream of consciousness scattered thoughts.1 Maybe we can use the comment section as a supplement. If you have good references for state-of-the-art solutions to inverse problems and how people handle nonconvexity, throw them there.

As a reminder, there *are* many upsides to posing your inverse problem as a convex optimization problem of the sort I described last week. The declarative convex programming approach lets you modularly combine a wide set of modeling primitives to describe your inverse problem. You would have dozens of solvers to choose from, specialized to particularly common patterns. These solvers would always return a global minimum, so debugging your model doesn’t rely on evaluating the optimization quality. There are decades of analyses that give insights into the solution quality you should expect given what you know about your processes, whether they be the forward model, the noise process, or the state description.

But there are definitely inverse problems that I don’t know how to pose as convex optimization problems. Sometimes, you don’t have a clean convex model of state plausibility. If you know that your state can be decomposed into a small combination of simple states, then you can apply the techniques from last week. But what if you don’t know how to characterize the simple states? What if you’d like to estimate the best set of states from the data as you solve the inverse problem? This problem is called *dictionary learning* and, to my knowledge, is always done using some sort of nonconvex optimization. The most famous example of dictionary learning is *principal component analysis,* which requires solving a singular value decomposition. The SVD is one of the few nonconvex problems we can provably, reliably, and efficiently solve.

Sometimes, your forward model isn’t well approximated by a linear model. For example, if your measurement is a convolution of a state with an unknown noise signal, the inverse problem, called a *blind deconvolution problem*, is not convex. And some forward models are just not well approximated by linear maps. A notably popular model that computer vision people have been into for the past few years, called *radiance fields*, is a light absorption model that generalizes the forward model in computational tomography. People use radiance fields for all sorts of inverse problems that map a collection of 2D photos into 3D (and higher dimensional) models. Though this model is a very crude approximation of light scattering, it makes compelling pictures. I don’t know of a clean way to linearize it.

When you hit these nonconvex boundaries (as you always will if you are an intrepid modeler), should you then turn to LLMs? Maybe? This feels like overkill. On the other hand, denoising lies at the heart of diffusion models. I mentioned the denoising problem last week, and it is one of the core primitives of inverse problems: lots of people have observed that if you can remove noise from a signal, you can solve the inverse problem too.2 This is the main idea behind diffusion models that use cascades of denoising algorithms (using complicated neural net gunk) to build robust image models. I don’t hate it!

Such complex models are probably overkill for most applications. If your forward model is nonlinear, you can just try nonlinear optimization. Since all of the problems I posed last week have equality constraints, you can eliminate the constraints to yield an unconstrained optimization problem and apply gradient descent. Gradient-based optimization has proved fast and powerful in radiance fields (e.g., this or this) and many other applications.

Finally, here’s a mindless application of machine learning to inverse problems that can be surprisingly effective. At the end of the day, our goal is to map measurements to states. We assume we know the mapping from state to measurement. This means we can simulate an infinite collection of pairs of plausible states and their associated measurements. I literally mean simulate here. You could call this a “training set” and then build a machine learning model that fits a map from measurement to states on the training set. I have seen this work well in more applications than I can count. In some sense, this is all of machine learning. You might say that all machine learning problems ask for the inverse of the map from hidden state to feature representation.

In inverse problems, we’re even better off than in general machine learning. We strongly believe the forward model is real and hence that pattern recognition is possible. The downside to this pattern recognition approach to inverse problems is you spend a ton of time training the model. However, with a model in hand, computation is super cheap for new inverses. You just apply your neural net to the measurement. In optimization land, every inverse costs the same, requiring solving an optimization problem on each new measurement. Even in convex inverse problems, this application of nonlinear machine learning could prove very useful.

1

And less lucid than normal as my brain fatigues from this extended heatwave.

2

]]>See this paper with Mahdi and Samet for some cool insights about convex and nonconvex denoisers, as well as a long reference list.

*This is the second part of the live blog of Lecture 11 of my graduate class “Convex Optimization.” A Table of Contents is here.*

This is my first time trying to teach Boyd and Vandenberghe Chapters 6 through 8 in a class, and I’m now realizing each chapter could be a semester. I’m definitely struggling to figure out how to do these surveys in 80 minutes. Yesterday’s linear inverse problems survey was much more rushed than I’d have liked.1

Sleeping on it, I wonder if I should have approached this lecture a bit differently. The spirit of the book is recognizing modeling design patterns and showing how simple building blocks can be fused in surprising ways to tackle a diverse set of problems. Let me do that today for inverse problems, taking a pragmatic look at throwing together convex models of linear inverse problems.

Recall from yesterday that the buy-in requires assuming a linear forward model from the state of the world to measurement. In notation, y is the measurement, x is the state of the world, and w is the noise. We assume “y=Ax+w” for some linear transformation A. We pose problems today where we are always solving for x and w at the same time. Those are our decision variables.

Linearity is a big assumption but is a good approximation with unreasonable frequency. Moreover, if you can get by with a linear model, you are guaranteed a tractable optimal solution to your inverse problem. Sometimes that tradeoff is worth it!

I’ll stick to two implausibility functions: sum of squares and sum of absolute values. Yes, the first is a squared norm and the second a norm. There’s all sorts of fun theory and geometry you can extract from those facts. But to build practical optimization models for inverse problems, the norm stuff doesn’t matter. I had considered writing the functions “𝜎” for the sum of squares and “𝛼” for the sum of absolute values, but I worried this would just confuse everyone. To keep the exposition clear, I’m going to write sum of squares and sum of absolute values as norms today, but I want to emphasize this is purely for notation. Keep that in mind as we see what problems we can pose with these two penalties.

**OLS.** Ah, ordinary least squares. Everyone’s favorite way to lie with data. In this case, we assume we’d be able to find x if it weren’t for some exogenous noise w. This suggests we should minimize the size of the noise that is consistent with the measurements. We penalize the sum of the squares of the noise because we don’t have any preferred direction in which the noise might point. I’ll refer to this as a “white noise” assumption today. The associated optimization model is

You can solve this one in closed form.

**Robust regression. **Suppose that you think the noise might be a combination of two signals, a white noise signal and a sparse signal with high amplitude. Or perhaps you think that some of the measurements are “outliers” and should be thrown away entirely. We could penalize the sparse and white noises separately as in this model:

This works really well! For my robust statistics folks out there, this optimization problem is equivalent to using the *Huber loss function *in regression. The parameter 𝝀 is a *hyperparameter* that you have to tune to balance the outliers from the rest of the noise. I’ll use 𝝀 freely below as a parameter you’ll have to set by modeling.

**Ridge Regression. **Ridge regression is useful for high dimensional regression problems where even when w is small, there are many choices for the state estimate x. In this case, we can now penalize the sum of squares of the state too, yielding a problem

This one is also nice because it can be solved in closed form.

**LASSO.** In many problems, we’d expect the state itself to be a sparse vector. In this case, we can penalize the sum of the absolute values of the state. This gives us a problem commonly called LASSO regression:

**Anything goes. **We really start cooking with gas once we allow ourselves to penalize weighted combinations of absolute values and squares. Let’s consider the general problem where we introduce two weighting matrices T and U and solve the optimization problem:

If T and U are both the identity matrices, this is called *elastic net regression*. If T and U are diagonal matrices, this formulation lets you weigh different components of the state differently, encouraging some coefficients to be small and some to be large. For example, suppose that the coefficients of x correspond to frequencies in a signal decomposition. Then maybe you’d add higher weights to higher frequencies to emphasize their implausibility.

Non-diagonal weights are also helpful. If you introduce finite differencing operations, you can penalize signals with large variations. For example, consider the matrix that looks at the difference between neighboring components:

Penalizing the sum of the absolute values of this matrix times x is akin to preferring an x with sparse derivative, i.e., a piecewise constant function. You can also penalize an approximation of the second derivative using the finite differencing matrix

Penalizing the sum of squares of the 2nd derivative encourages smooth models. These penalties even suggest the *denoising problem, *where we have a signal x corrupted by noise (so that A is the identity matrix):

Here we have a lot of noise, but a strong prior on what the state can be. When T=0, this problem can be solved in closed form, multiplying x by a matrix. This matrix “filters” the measurement to remove the noise from the state.

There are infinite possibilities here, and I’m not going to exhaust them in a blog post. All I can do is give you a taste of the rich, modular toolbox that convex optimization brings to the study of linear inverse problems.

1

]]>Apologies to everyone in lecture if that felt too much like drinking from a firehose. Ask me questions here if any of it was too confusing or if there were other modeling problems you had hoped I would cover.

*This is the live blog of Lecture 11 of my graduate class “Convex Optimization.” A Table of Contents is here.*

The most underappreciated part of Boyd & Vandenberghe is the endless trove of examples. Three chapters of the book are dedicated to modeling examples, and we’re now moving into the part of the course where we apply the theory to modeling.

We start with a topic dear to my heart: *linear inverse problems*. I’m casting a wide net with this term here.1 Let’s skip the linear part for a second. What do I mean by an inverse problem? *Imaging* is perhaps the easiest problem to describe. We imagine some 2D or 3D representation of the world we’d like to capture through a measurement device and want to use what we know about our measurements to build the image.

In a CT scan, we send lines of X-rays through your body and then use a computer to decode an image of your insides. MRI is similar, except it uses complex measurements of the magnetic fields induced by molecules in your body. Neither of these modalities takes “images” in the way we think a camera does, yet we can use algorithms to decode what’s inside. These are canonical inverse problems. We measure the state of the world through some process, and we’d like to determine the state from our measurements.

Even cameras solve inverse problems these days. Algorithms in your phone can take multiple blurry snapshots and use algorithms to make them into a deblurred image. Similarly, multiple images can be combined to yield high dynamic range when the lighting is unfavorable.

All of these modalities combine a *forward model,* which simulates the physics of what happens between the world and the imaging sensor, and an *image model,* which builds in what we think should be true about our final image. Algorithms then combine these together to produce a final reconstruction.2

Surprisingly, all of the examples above can be computed using linear models. The mapping from the world to the measurement is a linear function, insofar as if you “add stuff,” the resulting measurement is their sum. For example, the measured output in a CT scan is the amount of X-rays that made it through the body. This decreases linearly in the amount of absorbing material between the X-ray emitter and detector. An image blur is a linear operation that adds together multiple views of a scene. A linear inverse problem is one where the mapping from reality to our measurement is modelable as a linear transformation.

There are surprisingly many problems that can be posed as linear inverse problems. A few examples include finding a model of a dynamical system that predicts time series data, estimating biomarkers relevant to particular disease outcomes, and finding the delays in GPS signals so that you can triangulate and find your location. It’s popular to use linear maps to model your content preferences based on your past consumption behavior and all of the consumption behavior of everyone else on the internet. I could do a whole blog series on inverse problems and their applications. For this course, we’ll settle with one post.

If we have a linear model, we are saying

`measurement = forward model ( image ) + noise`

Where the “forward model” function is a linear mapping. Our goal is to figure out what the image is by removing the noise and inverting the forward model. The problem here is that there is an infinite set of images and noises that produce the same measurement. What do you attribute to noise? What do you attribute to signal? The answer is to come up with some cost function that balances attribution. The general linear inverse problem is then given by an optimization problem of the form

```
minimize (1-h) * implausibility(noise) + h * implausibility(image)
subject to measurement = forward model (image) + noise
```

where h is a constant between 0 and 1. The implausibility functions are different for the noise and the image and are part of the modeling process. You want to pick functions that are easy to optimize (hint: convex) and large for implausible realizations of the signals you care about. If you were Bayesian, you might call these *priors* about the signals in your problem. But prior is a loaded word in Bayesian land, implying things about probabilities and whatnot. Today, I’m going to stick with calling them implausibility functions to avoid statistical mess.

The most common implausibility functions we use date back to Gauss. One example is the mean-sum-of-squares, which captures some notion of a signal’s variability. Another implausibility penalty would be the sum of absolute deviations. This function is useful when you expect the noise process to be bursty or sparse.

Similarly, different convex implausibility functions encourage particular image structures. Sparse images are encouraged by penalizing the sum of absolute values. Piecewise constant images are encouraged by penalizing the sum of absolute values of the derivative. Smooth images are encouraged by penalizing the sum of the squares of the second derivative.

The modeling choices here are intimidating as it’s not clear what implausibility penalties you should pick. But there’s a rule of thumb that I’ve found helpful in linear inverse problems (and which I may have written a dozen or so papers about). The implausibility function is trying to encourage your algorithm to choose a simple signal. One means of simple is “the signal can be written as a short sum of basic signals.” For example, you might assume your signal is just a spike train, then it can be written as a sum of spikes. Sparse images are a sum of a few basic point sources. A low rank matrix is a sum of a few rank one matrices. In all of these examples, the assumption is that the signal is a sum of a few *atoms. *If you know the atoms, a reasonable implausibility function penalizes the coefficients needed to write a signal as a sum of atoms. In math, for those who like equations:

When the atoms are unit vectors, this is the Euclidean norm (root mean squared error). When the atoms are standard basis vectors, this is the sum of absolute values (l1-norm). There are tons of other options. No matter which atomic set you choose, this implausibility function is convex. The general linear inverse problem is thus posed as a convex optimization problem. This gives you a general cookbook for posing and solving inverse problems.3

Now what about that constant h? It is called a *hyperparameter *and is part of your convex model. Sometimes you know what that parameter should be, but other times you need to figure it out experimentally. But remember, this whole model is made up, so don’t fret if there are parts you can’t specify by pure reason alone.

1

I wish instead of having to use the dumb terms “machine learning” or “artificial intelligence,” we could say, “I work on inverse problems.” If only that carried the same cachet. Maybe I’ll work on improving our field’s marketing.

2

Today, I’m going to skip over the philosophy of why we’ve decided to trust these algorithmic reconstructions of reality. Why do we believe that what we see in an MRI image is really what’s inside a body? A fun post for another day!

3

]]>If you want to read more about atomic decompositions and inverse problems, one of my favorite papers on my CV, *The Convex Geometry of Linear Inverse Problems, *with Chandrasekaran, Parrilo, and Willsky, describes this general view and its implications.

*This is the live blog of Lecture 10 of my graduate class “Convex Optimization.” A Table of Contents is here.*

Last week, I showed how weak duality is an almost tautological condition. But convex problems also (for the most part) have *strong* duality, where the optimal value of the dual problem is equal to the optimal value of the primal problem. There are many different intuitions for why strong duality holds for convex problems, but the argument I like the most (and the one in Boyd and Vandenberghe) leans on separating hyperplanes.

Today’s blog has more notation than I’d like. Duality, it seems, is hard to motivate without getting technical! But my goal is to set up two pictures that illustrate why we should expect strong duality for convex problems and not nonconvex ones. Hopefully, those illustrations provide some intuition.

Let’s focus on a simple optimization problem today

Let p* denote the optimal value of this problem. To prove strong duality for this problem, we want to show that there is some value of the Lagrange multiplier, 𝜆, such that the Lagrangian provides an *upper bound* of the optimal solution

Let’s think about what this condition says geometrically. Consider the *graph* of the optimization problem

Then strong duality holds if we can find a 𝜆 where

For all (u,t) in the graph with u nonpositive.

This means we are trying to find a hyperplane that contains the graph of the optimization problem. Since we are looking at a problem with half spaces, we should turn the graph into a convex set. Let me define an “epigraph” of the optimization problem:

The desired strong duality condition amounts to a hyperplane that supports A at (0,t) and is sloping downward as in this picture:

Stare at the picture until you convince yourself it is the right condition. I always have to wrap my head around the mapping from the declarative form of the optimization into this geometric object. From this geometric view, the optimization problem is equivalent to minimizing the value of t in A with u=0. That is

The dual problem is maximizing the t-intercept of hyperplanes containing A. Indeed, for any value of 𝜆, the dual function value is the t-intercept of the associated hyperplane.

A picture also shows us what might prevent strong duality in the case of nonconvex functions.

Here, the t-intercept can’t reach the optimal primal value as it’s blocked by the nonconvex geometry.

These pictures suggest a path to verifying strong duality by showing there is a supporting hyperplane at the point (0,p*). To do this, define another set

A and B don’t intersect, so there is always a separating hyperplane

By the definition of A, 𝜶 and β both have to be nonnegative or else the affine function they define would be unbounded below on A. If 𝜶 is *positive*, then we can set 𝜆 = β/𝜶, and we will have proven strong duality.

To guarantee that 𝜶 is positive, you have to assume some extra facts about the optimization problem. These assumptions are called *constraint qualifications*. The only two most people need to know are (a) linear programming has strong duality and (b) if there is a point with f_1(x) *strictly *less than zero, you have strong duality. The second condition is called *Slater’s condition*.

Geometrically, strong duality was almost inevitable for convex functions. The Lagrangian was a hyperplane in disguise. We turned a problem about a convex set into a problem about the hyperplanes that contain that set. The duality between convex sets and the half spaces containing them is the fundamental property of convexity that enables all convex optimization.

]]>A decade ago this month, Apple released the iPhone6. It would become the best-selling iPhone of all time. Though you can point to camera upgrades and battery life, the 6 didn’t have that many new features. It was just the first robust enough to handle our tech-addled physical abuse and Apple’s incessant OS upgrading for any extended period.

And it was fatefully the last iPhone with a headphone jack. Despite the new camera widgets and annoying AI add-ons, our experience with mobile devices hasn’t changed in a decade.

“What if the real end of history was the iPhone6?” This was the open question Jay Kang asked to conclude our conversation on the Time To Say Goodbye podcast last month. That conversation with Jay and Tyler and Leif Weatherby was about how we’ve all grown tired of the tyranny of data. 2014 was the peak of the promise of data for everything, too. Data was BIG, and that BIG data was the new oil. People were in love with neural nets again. Everything was going to get so much better by moving to “AI first.” Part of the loss of shine of data is the fact that we’ve been in a decade of technological stagnation.

And yet, in rich countries, our experience of computation hasn’t changed in years. Computers are not only not getting faster, they are not perceivably different. If anything, the medium has gotten more frustrating. A tab in a browser that views a 5KB text file needs 200 megabytes of RAM. When you buy a printer, you have to pay a subscription fee to use the toner cartridge. Everything now has a touchscreen, no matter whether that makes the interface more frustrating or not. You can’t serve ads without a touchscreen. New technology is less tactile, has longer lags, and has more ads.

People have been making similar complaints since the fateful release of the iPhone6. In September 2014, The Baffler magazine brought together two very different thinkers to discuss their disappointment with the stagnancy of technological progress and debate their alternative proposals for the future. On one side was Peter Thiel, the billionaire capitalist wizard, founder of PayPal and Palantir, funder of Facebook, and champion of Silicon Valley.1 On the other side was David Graeber, the prolific anthropologist and one of the best-known, most eloquent voices in the Occupy Wall Street movement.

You’d figure these two would be at each other’s throats, but both men had penned essays motivated by the question of why, in 2014, we didn’t have flying cars. The Baffler decided to get them together to see how two very different intellectuals came to the same conclusion.

Moderator John Summers framed the debate with the question: “What's the matter with America, and what does technology have to do with it?” Why did none of the predictions of the 1960s pan out?

Graeber began his presentation with a discussion on the prescience of science fiction. Verne and Wells talked about “flying machines and submarines and rockets and talking boxes” that all came to be within the next 50 years. But the sci-fi of the 60s promised “anti-gravity sleds and teleportation devices and Mars bases and robot androids that could do chores for you and immortality drugs.“ None of this came to pass. Why was it that the science fiction of the early 20th century was predictive, but the 1960s fiction was not?

You can say, “because it’s fiction,” but science fiction reflects our present into possible futures. It is a genre that distills cultural optimism. These books were set in an ostensibly near future. *2001* was supposed to have a space odyssey, remember? Why was our optimism in the 1960s so diverged from the future of the 2010s?

Graeber blames the stagnation on the simultaneous corporatization and bureaucratization of research and development and the virtualization of money. It’s not just that post-war bureaucracies have metastasized. It’s their systematic obsession with key performance indicators, forcing everyone to constantly engage in a game of marketing and competition.

Thiel doesn’t disagree on the problem but spars with Graeber about the solution. Whereas Graeber argues for a wide sharing of resources so that people can be unburdened from constant self-marketing, Thiel thinks these resources should go to a small group (his friends). Thiel’s aggressive venture capitalism embraces a revolutionary politics, one that envisions funding misfits to attack our institutions to change things for the better. In Thiel’s mind, startups are radical means of reshaping society through rule-breaking. If you get a *small* number of people together with a vision, they can break through technocratic regulatory logjams. He cites PayPal, Uber, and Airbnb. Thiel argues that the only way to fix sclerotic academia is to abandon it, giving resources to the most talented, visionary young people instead. He trumpets his Thiel Fellows program.

Despite their anarchistic optimism, the decade after the debate gave us more of the same virtualized, financialized information technology that makes the rich richer and our spirits poorer. We still don’t have flying cars or teleportation devices. What comes next?

I have no idea what comes next. I don’t make predictions because I’m always wrong. But I can reflect a bit on this conversation.

Hindsight shows Thiel’s model isn’t the solution. It’s part of the problem. It makes some people very rich and brings us more of the same. All of his exciting startups bring us further virtualization, financialization, and exploitation. Thiel fellows have brought us… Figma and Ethereum. Even Techbro Übermenschen couldn’t free us from the vicious chain of technocracy. Part of the reason why the big technologies of the past decade have been crypto and AI is that the tech itself has stalled.

Frustratingly, Graeber didn’t propose an alternative path. He vaguely argued for a liberal anarchistic utopia where people were given the means to figure things out for themselves, but he didn’t make a case for how to get there. Graeber was ideologically against proposing policies. In 2018’s Bullshit Jobs, he wrote

“If an author is critical of existing social arrangements, reviewers will often respond by effectively asking ‘so what are you proposing to do about it, then?’”

I’ve been thinking about how to write something about this remarkable debate for a month. I’ve failed to publish it because I wanted to have some sort of optimistic end. But the 10th anniversary of the iPhone6 is over tomorrow. Tomorrow is the future.

I’ll have to settle for channeling my inner Graeber and accept that sometimes you just have to articulate a problem. I’m not proposing anything other than a broader awareness here. Critiques like those leveled by Graeber and Thiel have been around since the 1980s. Whether it’s the postmodernists or Neil Postman, we’ve had plenty of substantial articulations of the downsides of digitizing everything. But now, with Moore’s Law over, with the internet a mess of AI slop churned out by energy hungry GPU farms, and with few cool ideas on the horizon, maybe we can think again about what comes after data. I like rethinking things, and a call for dialog can itself be a positive ending.

1

]]>2014 was before Thiel had been outed for bankrolling Hulk Hogan’s lawsuit against Gawker. It was before he came out as a full-throated Trump supporter. It was before he was advocating doomsday prepping in New Zealand and drinking young blood. A kinder, gentler, younger Thiel. But he was clearly a right of center thinker in 2014, having written for the National Review and publicly espoused arguably radical libertarian politics.

*This is the second part of the live blog of Lecture 9 of my graduate class “Convex Optimization.” A Table of Contents is here.*

What a difference a lecture makes. Yesterday morning, I was apprehensive about class because I couldn’t find a satisfying and intuitive explanation of duality. But… now I think I have one? Let me try it out on you, my dear readers, and let me know if this clarifies anything for you or if it makes things even more confusing.

Advance warning: I’ve been trying to keep these blogs on the less technical side, but Friday posts have ended up being extra technical. Today’s Friday post is not an exception to this emergent rule.

Earlier this week, I discussed what it would mean to prove that you’ve found a solution to an optimization problem. If you have a candidate solution, you can plug it in and check feasibility. This is usually easy enough to compute. However, *verifying* optimality was more involved as we had to check conditions held *for all* points in the feasible set. I gave a few examples of ways to construct such proofs of optimality. Duality theory provides a path that generalizes all of them.

The main idea is to construct lower bounds. If you had a way of rigorously generating lower bounds on the optimization problem, and the lower bound equaled the objective value of your candidate solution, then you would have a proof of optimality. You need a potential solution and a lower bound saying that all points have objective value greater than or equal to the one you are proposing.

Duality starts with an explicit construction of lower bounds to optimization problems. We find a family of lower bounds and then construct a dual problem by finding the best lower bound. This will be called the dual problem.

Let’s consider the constrained optimization problem.

For now, I won’t assume anything about the convexity of these functions. We can think of constrained optimization as unconstrained optimization if we let ourselves work with functions that can map a point to infinity. Solving the optimization problem is the same as minimizing the unconstrained function

You might not have a good algorithm to deal with infinities, but from a conceptual standpoint, this *extended real-valued function* captures what the optimization problem *means*. The only x with finite objective are the feasible points.

I can define a family of lower bounds for my extended real-valued function p(x) by introducing the *Lagrangian*:

The Lagrangian has three arguments. It takes a primal value x and two *Lagrange multipliers* 𝜶 and β. For fixed Lagrange multipliers, the Lagrangian is a lower bound of the function p as long as 𝜶 is greater than or equal to zero. This is because if you plug in a feasible x, the first summation will be nonpositive and the second summation will be zero. Therefore, the Lagrangian yields a value less than f_{0}(x). For a nonfeasible x, the Lagrangian will be some number less than infinity.

How tight are these lower bounds? For a fixed x, we have

That is, for each x, there is a sequence of Lagrange multipliers so that the value Lagrangian converges to p(x). To see why, note that if x is feasible, you’re going to want to set 𝜶_{i} to zero whenever an inequality constraint is satisfied. But if x is not feasible, the supremum must be infinite. If f_{i}(x) is greater than 0, the supremum of 𝜶_{i} is infinity. Similarly, if an h_{j}(x) is nonzero, the supremum with respect to β_{j} is infinite. Following this argument to its logical end, I’ve argued that the original optimization problem we cared about is equivalent to the minimax problem

Now let’s think about the quality of the lower bound provided by each fixed assignment of the Lagrange multipliers. First, the Lagrangian is unconstrained, so you might try to run gradient descent to find a minimum. Since this Lagrangian function is a lower bound of our optimization problem, if the minimum that you find is feasible, you have found an optimal solution of the optimization problem. That’s pretty powerful already! If you instead find a nonfeasible point when minimizing the Lagrangian, you still get a lower bound on the optimal value of the original optimization problem. For each 𝜶 and β, we can quantify the value of this bound with the *dual function*

The value of g is always a lower bound on the optimal value of the optimization problem. Over all choices of the Lagrange multipliers, the best lower bound is also a lower bound. The maximin problem

is called the *Lagrangian dual problem* or usually just the *dual problem*. Our original problem is then retroactively dubbed the *primal problem*. The optimal value of the dual problem is always a lower bound of the primal problem. This inequality is called *weak duality* and follows from the simple argument I wrote yesterday. Today’s exposition has motivated duality as a way of generating lower bounds, so this shouldn’t be too surprising.

Now, why should we care about *this* family of lower bounds? The Lagrangian is an affine function of the Lagrange multipliers. The dual function is an infimum of affine functions. That means it is concave, no matter what f_{i} and h_{j }are. The dual problem is always a convex optimization problem.1 We’ve derived a powerful tool to construct lower bounds for intractable problems.

In convex programming, there’s a second remarkable benefit. If the primal problem is convex and properly conditioned, then the dual and primal optimal values are *equal*. This is called *strong duality.* With strong duality, we have arrived at our initial goal of proving a particular solution of the primal problem is optimal. A dual optimal solution certifies the optimality of a primal optimal solution. Moreover, these primal and dual solutions often come in pairs, where you can compute one from the other. I will cover strong duality and such optimality conditions next week.

1

]]>I hate the terminology, but we’re stuck with it: maximizing a *concave* function over a convex set is technically a *convex* optimization problem

*This is the live blog of Lecture 9 of my graduate class “Convex Optimization.” A Table of Contents is here.*

Duality theory in optimization is deep, beautiful, and mysterious. For every minimization problem, we can construct a *convex* maximization problem whose optimal value is less than or equal to the optimal value of the minimization problem. The minimization problem is called *the primal problem* because it’s the one we primarily care about. The maximization problem is called *the dual problem,* and it helps us reason about the primal problem.

Lower bounds are useful: if you have a point that you think is optimal for the primal problem and a potential solution for the dual problem with the same objective value, you have found a solution for *both* the primal problem and the dual problem. In fact, we’ll show that any feasible point for the minimization problem will have an objective value larger than any feasible point for the new maximization problem. And even if your primal problem is nonconvex, the dual problem is always convex, so it can provide insights into what might be achievable with your intractable primal problem.

Duality theory also gives us insights into the robustness of solutions to specification and the sensitivity of different modeling assumptions. It inspires algorithmic strategies for solving both convex and nonconvex problems. It’s a powerful theoretical and applied tool and is essential to understand if you want to be a practicing optimizer.

And yet, I’ve been struggling all morning to figure out how to introduce the topic without it seeming magical. There are different paths to introduce it, and all of them feel weird and confusing. Boyd and Vandenberge jump right in with a Lagrangian function, but where did Lagrange come up with those ideas of Lagrange multipliers? It takes some work to motivate this! Some people start with abstract convex geometry, relating graphs of optimization problems and their characterization in terms of separating hyperplanes. I love this derivation, but it takes me an hour of confusion to remember how to explain it (apologies to Dimitri Bertsekas). The final introduction, which parallels the historical origin of optimization duality, is through minimax theory. But, again, minimax theorems are also mysterious.

Well, let me try to go through the minimax theorem because it did really start this whole thing rolling. I’ll introduce it the same way von Neumann discovered it: in terms of a game. The game has two players, Player One and Player Two. Player One and Player Two have a known joint function F(x,y). Player One wants to choose x from a set X to make F as large as possible. Player Two wants to choose y from a set Y to make F as small as possible. At the end of the game, Player 1 gets F(x,y) points, and Player Two gets -F(x,y) points. Does it matter who plays first?

If Player One goes first and plays x1, then Player Two will minimize with respect to their available choices. That is, Player Two chooses y to achieve a payout

Therefore, to anticipate this move by Player Two, Player One should pick the x that maximizes this infimum. Player One’s best strategy yields a final payout of

From the same analysis, if Player Two chooses first, then the final score is

Should Player One opt to go first or second? How does a maximum of a minimum compare to a minimum of a maximum? The following nearly tautological analysis shows that Player One should always choose to go second.

Take *any* function F(x,y) and *any* sets X and Y. Then for a fixed x0 and y0,

Now, this means that if you take the supremum of both sides, this inequality is valid

Since this holds for all values of y0, we have

That is, minimums of maximums are always greater than maximums of minimums.

In 1928, John von Neumann realized that they were often equal. In this game, suppose the players choose their moves *at random*. Even if players declare their strategies in advance (each player declares their sampling distribution), *there is no advantage to the order*. In math, if X and Y are both probability simplexes, von Neumann proved that

If players play these random strategies, they can even tell their opponent in advance what their strategy is and still be optimal.

Von Neumann derived this in studying parlor games, but its impact has been felt far beyond game theory. In fact, if X is a convex compact set and Y is convex, then the minimax theorem is still true. The min max equals the max min.

What does this have to do with optimization? In class today, I’ll show how to write minimization problems as a min max. The dual problem will then be the associated problem when we switch the order of minimum and maximum. Like I said, it’s mysterious. I’ll write back tomorrow to report on how it went.

]]>*This is the live blog of Lecture 8 of my graduate class “Convex Optimization.” A Table of Contents is here.*

The two central concepts of constrained optimization are feasibility and optimality. A point is feasible if it satisfies all of the constraints. A point is optimal if it is feasible *and* its objective value is less than or equal to the objective value of all other feasible points. To check if a point is feasible, you evaluate each constraint and verify that it holds. This is almost always a straightforward numerical procedure based on a few primitives (e.g., Is some function less than zero? Is the vector nonnegative? Is a matrix positive semidefinite?).

Checking optimality is trickier because of the associated logic. Feasibility is a “there exists” problem. You can check feasibility by plugging in values to an appropriate formula. Optimality is a “for all” problem.

**Feasibility**:

There exists an x that satisfies the constraints and has f(x) less than or equal to t.

**Optimality**:

For all x that satisfy the constraints, f(x) is greater than or equal to t.

How can you check that a point has a lower objective value than every other? For a function to be verifiably optimal, we need a “short proof” of optimality. We need to be able to plug something into a formula that proves optimality. In convex optimization, we turn optimality checking into feasibility problems.

The transformation is very subtle. The first thing to note is that a function is locally optimal for a convex optimization problem if the directional derivative is positive in all directions that could stay feasible. For if the directional derivative were negative in some direction, you could move your point a small amount and decrease the function value. A point is optimal if all such decreasing moves leave the feasible region.

For convex sets, the set of all feasible directions is a cone (if two directions move a point inside a convex set, their sum and scaling do too). I’ll call it the feasible cone. Here’s a nice picture of some feasible cones:

The grey shading indicates the directions that can stay feasible. In the left picture, any direction in a half space can move inside the circle. On the left, only a restricted set of directions can stay inside the polygon.

Everyone forgets this from multivariate calculus, but the directional derivative of f in the direction v has a formula. It is the dot product of v with the gradient of f:

Rederive this for yourself! We want this dot product to be nonnegative for all v in the feasible cone. That means we have stumbled on a bizarre geometric fact. Verifying if x is optimal for a convex problem is thus transformed into if the gradient of the objective is in the dual of the feasible cone at x:

Fancy! We went from some simplistic geometric arguments to checking membership in a dual cone. But it is sort of neat: we transformed a “for all” problem into a “there exists” problem. If we can compute the dual cone of the feasible set, optimality checking is no harder than feasibility checking. This is only true for convex problems and is again a reason why we spend so much time on them.

For the circle above, the feasible cone is a single vector. It’s saying that the gradient of f must point along the x-axis. Here’s a picture showing the dual of the feasible cone for that polygon:

The cone is slightly wider than the polygon itself.

As is always the case in this class, we move back and forth between geometric intuition and algebraic verification. We can explore a few instances of what this condition means algebraically.

For unconstrained problems, every direction is feasible. This means the dual cone is the point 0. That is, a point minimizes a convex function over R^{n} if and only if its gradient is equal to zero. I am sure many of you have seen that condition before. At least this extra algebraic geometry is recovering common sense.

When does x minimize f over the set where Ax=b? It turns out that this is true if and only if there exists a v such that

This v is called a Lagrange multiplier. We’ll have a lot more to say about these on Thursday.

What about minimizing functions over nonnegative vectors? The optimality conditions here start to look a bit more exotic. x minimizes f over the nonnegative vectors if and only if the gradient itself is a nonnegative vector, the gradient is equal to zero in the components where x is nonzero, and x is equal to zero in the components where the gradient is nonzero. There is, let’s say, a *complementarity* between where the gradient vector is zero and the x vector is zero. Think about this condition in 2d:

On the boundary of the orthant, the only way to move into the set is to be orthogonal to that particular axis.

Now, if we had to derive optimality conditions on a case-by-case basis for all possible constraint sets, we wouldn’t be able to solve anything. On Thursday, we’ll dive into a systematic way to generate these sorts of optimality conditions: *convex programming duality.* The two ideas here, complementarity and Lagrange multipliers, will form the basis.

Yesterday, Semafor posted a silly article about a silly company using chatbots to simulate public opinion and predict elections.

I mean, good for them, I guess.

A dozen people sent me this article. My friend wrote me “It does feel like someone pitched their editor: ‘What if we make this one Berkeley guy really really angry.’” I found it too on the nose to be that annoyed, but it did seed a question in my head that I haven’t been able to shake. I asked on Twitter: “We all believe pollsters aren’t actually doing this. But how could someone actually tell the difference?”

I was serious about this question. What exactly *is* an opinion poll?

Pollsters want to estimate the percentage of people in a population who would answer yes to a yes or no question. In a perfect world, everyone in the population could potentially be asked this question and would always answer truthfully. If this was the case, you could pick a random sample of about 800 people, ask them the question, and compute the percentage that answers yes. By the unquestionable laws of frequentist statistics, we’d be guaranteed that the true answer would lie within 3% of the sampled answer for 95% of the potential random samplings.1

But of course, no poll works this way. The process instead goes like this: the polling company has some means of getting people to answer questions. Maybe they can call up landlines. Maybe they can gather a panel of participants to click on a web form. Maybe they can harass people on the street. By hook or by crook, they gather a sample of people and hope they respond. Some people answer some of their questions truthfully. Some people tell them what they think they want to hear. Some people lie. Some people tell them to leave them alone.

With this pristinely collected data, the pollsters have to come up with a percentage to send to the press. They do not send you the raw percentage! Instead, they build a statistical model to impute the unanswered questions and adjust for sampling and nonresponse biases. Whatever this model says, that’s what they report. But that model has tons of choices and knobs. If you give different pollsters the same data, they give you wildly different answers. Nate Cohn tried this experiment in 2016. He gave 5 “good” pollsters the same data and found a 5% split between the pollsters about what numbers to return. The systematic bias of “house effects” is as large as the “margin of error.”

Of course, the pollsters still tell you the margin of error is 3%! This is at best misleading and at worst a lie. The 3% MOE happens if you sample uniformly from a population. If you do whatever weird data collection and post-processing procedures the pollsters do, that 3% frequentist guarantee goes out the window.

Polling is quantitative social science with less openness. And quantitative social science is a giant mess. I know not to trust any result in quantitative social science. I know you can’t fix quantitative social science with metaanalysis (here’s looking at you, poll averagers). Given what we have learned from the “replication crisis” and the infinite set of forking paths in model adjustment, why should we believe any of the numbers that come out of these polls? I’m going to go a step further. Why should we believe that pollsters actually talked to people? How could you or I know for sure that pollsters ever did a survey?

The answer is trust. We’re supposed to trust certain pollsters because certain media empires tell us that they should be trusted. The claim is this trust comes from “track record” but what a poll in july tells us about a result in November is dubious at best. And estimating a pollster’s track record is impossible to reliably validate or corroborate.

No, the trust here is established through incestuous politico-media relationships. But what is the relative news value of a poll to one of those obnoxious undecided voter panels? What is the value over an anonymous source? Just because they give you numbers, perhaps to three decimal places, doesn’t mean polls, pollsters, or poll analysts deserve our attention.

1

]]>Every time I write out the definition of a confidence interval, god kills a kitten

*This is the second part of the live blog of Lecture 7 of my graduate class “Convex Optimization.” A Table of Contents is here.*

At least once a semester, I end up with a lecture where I only get through half of what I had planned. It’s always worth reflecting on what made this content particularly challenging, so as a note to my future self, I’m just going to write what I had intended to lecture about in today’s post. Next time I’ll remember to split this over two lectures. Today will be closer to traditional lecture notes with too many mathematical formulas. I apologize in advance to all of you reading.

A convex cone is a set where arbitrary nonnegative linear combinations of the points in the set lie in the set. The canonical example we’ve already encountered is the cone of nonnegative vectors, the nonnegative orthant. Any nonnegative linear combination of nonnegative vectors is a nonnegative vector. Two other cones show up a lot in optimization: the second-order cone

and the positive semidefinite cone (the set of all positive semidefinite matrices).

These cones are central to optimization because their geometry makes it easy to verify optimality.

How do you check if a point x minimizes f(x) over K? It is necessary and sufficient that

This formula says that the directional derivative of f in all directions that stay feasible in K is nonnegative. That makes sense as an optimality condition! I’ll prove it in full generality on Tuesday. When K is a cone, this formula simplifies to the following check: x minimizes f over K if

Where K* is the *dual cone* of K:

K* is itself a cone (it is closed under nonnegative combinations). Verifying optimality in a cone program amounts to testing if a point is in a dual cone.

The wild part about the three cones I introduced above is they are *self-dual*. The set of vectors whose dot product is nonnegative with every nonnegative vector is the set of nonnegative vectors. You can check the same holds for the other two cones too (Verifying this in class turned out to require more algebra than I had anticipated. Sigh). You can also check that any cartesian product of self-dual cones is self-dual. That is, the cone

Is self-dual if K1 and K2 are.

If you have a cone where you can do local search, you can also verify local search has converged. That’s half of why I wanted to introduce the self-dual cones today. The other half is because you can pose 90% of the problems in the book as a linearly constrained cone program

where K is a Cartesian product of orthants, second-order cones, and positive semidefinite cones.

It took me too long to get to this point, so I ended up rushing all of the examples. But let me list them here, and you can check the relevant sections in the book for full details.

First, see Section 4.4.2 with regards to second order cones. Any convex quadratic constraint can be posed as a second-order cone constraint. Convex quadratic programming is a special case of second-order cone programming. I made this so much more complicated in class than it needed to be! If you were confused, check out Section 4.4.2.

SOCPs also come up in robust linear programming. This discussion consumed more time than I had anticipated, but I think it was worth it in retrospect. I’ll add a whole lecture on robust optimization in the second third of the class.

Stochastic programming is a bit confusing. We want to devise a policy where we model the constraint as a random variable that we don’t get to see until after we declare our policy. We also assume our policy has no impact on the randomness. This lets us pose a *chance constrained* problem

where the randomness only lies in the vector a. We choose x so that whatever a ends up being, the constraint is satisfied with high probability. It turns out that for an appropriate choice of uncertainty set U, this problem is equivalent to the *robust linear program*

This problem is a linear program with an *infinite* set of constraints. There is a constraint for every a in U. But it captures what the stochastic problem was after. A probabilistic constraint asks about the worst possible occurrence in a confidence set. U is the appropriate confidence set. In the case that the probability is Gaussian, the confidence set is an ellipse, and both problems can be posed as second-order cone programs. 4.4.2 has all the gory details!

I ran out of time before I could discuss semidefinite programs. I was going to work through the matrix norm minimization problem from Section 4.6.3. Take a look and see if it makes sense! In particular, I wanted to work through some manipulations of Schur complements. This is a super useful manipulation that turns very nonconvex looking problems into SDPs. Those tricks will have to wait until after we cover duality, but you can see them in Section 4.6.3.

The part I’m most disappointed about is I didn’t get to explain that LPs and SOCPs are special cases of SDPs. Any Cartesian product of orthants, second-order cones, and positive definite cones can be embedded in a large semidefinite cone.

If x is a nonnegative n-vector then diag(x) (the n x n matrix that has x on the diagonal) is positive semidefinite. If (x,t) is in the second-order cone, then

In this sense, SDPs are the master problem for understanding the sorts of constraints over which we can tractably optimize. You might argue with what I mean by “tractable” when we try to implement SDP solvers. However, as a theoretical construction, understanding the inevitability of semidefinite programming is fundamental to understanding the state of convex optimization.

]]>*This is the live blog of Lecture 7 of my graduate class “Convex Optimization.” A Table of Contents is here.*

Let me start off today with some of that dreaded linear algebra. The most important matrices in numerical computation are the positive semidefinite matrices. A square, symmetric matrix is positive semidefinite if all of its eigenvalues are greater than or equal to zero. In today’s class, I’m going to describe how almost all of the problems in Boyd and Vandenberghe can be posed as

Here A_{i} are all n x n matrices. n is whatever integer you want, but you should try to keep it as small as possible in your modeling. This problem is called a *semidefinite program* (SDP).1

Semidefinite programming generalizes linear programming to the positive semidefinite cone. In linear programming, the constraint “Ax≥b” is the same as writing

In SDP, we’re just swapping nonnegative vectors of LP for positive semidefinite matrices.

The range of problems you can write as SDPs is amazing. Any linear program or convex quadratic program falls under the hood. Minimizing a univariate polynomial on the unit interval can be posed as an SDP. You can even pose ordinary least-squares as an SDP. Instead of solving

you can solve

You shouldn’t solve it this way.

Let me be very clear here: I think it’s important to learn about SDPs and their generality and their relationship to convex programming more broadly. But my general advice is to avoid SDPs at all costs. At the end of the class, I’ll describe general algorithms for solving SDPs. They technically run in polynomial time, but it’s a pretty big polynomial (O(m^6), maybe? It’s so large I’ve blocked it from my memory). If you tried to solve all of your problems with SDP solvers, you’d be waiting a very long time.

One of the utopian ideas of optimization research is that we can develop modeling languages that can parse the high-level models posed by practitioners into code efficiently executable by optimization solvers. Since the 1970s, people have proposed different algebraic modeling systems for this task. This ideal was a substantial motivation behind Boyd and Vandenberghe’s text. The ambitious agenda was that disciplined convex optimization could solve the solver interface problem for a huge set of relevant models.

Despite the captivating vision, no universal modeling language ever really panned out. Someone should write a longer article examining the history of these languages and why they failed to capture broad adoption. There’s an important lesson for the optimization community in understanding why PyTorch became super popular, but GAMS remained niche for 50 years.

I have some guesses, but I need to think more about how to tell a coherent story. I’ll give a few scattered initial thoughts. First, I think targeting multiple solvers is problematic. Optimization often feels like a zoo of methods. Solvers force your hand in formulating problems in very particular ways. If you want to use gradient descent, you have to model everything as a giant objective function, eliminating the constraints. You might then try to use *projected* gradient descent, but this requires knowing how to project onto sets. Hence, you’d be limited to modeling your problem as an objective function subject to a single constraint that you choose from a limited menu. If you have a linear programming solver, you work to formulate your problem as a linear program. If you want to minimize a quadratic cost with your linear constraints, now you need to go and find a quadratic programming solver. And so on and so on.

There are too many equivalent ways to pose optimization problems, and we have no way of knowing in advance which formulation will be solved most efficiently and accurately by the solvers we installed on our workstation. This is even true if all we have is linear programming solvers. Software will solve two equivalent linear programs in vastly different times. Removing the practitioner from the solver interface forces them to invent weird tricks in the modeling language to get better performance. This leads to writing obfuscated models, betraying the ideal of letting modelers model in their natural language.

Matching numerical solvers to problem instances is even a problem in equation solving. Suppose we just want to solve for x in the equation Ax=b. There is a zoo of different possibilities for this too. How many different routines are there in LAPACK? More perniciously, there is the problem of *conditioning*. Suppose A is a positive semidefinite matrix. Then solver performance will vary depending on the eigenvalues of A. If the largest eigenvalue is much larger than the smallest, solvers will get bogged down in convergence and numerical errors. These problems are called *ill conditioned. *Even in linear systems, you need to figure out how to pose the problems so the solvers won’t run into problems of conditioning.

I’ll come back to this question about why the modeling language utopia never panned out in future posts, but let me close with something more constructive. Steve Wright thinks that optimization is best approached like an old tool chest in the garage. You’re going to have a wide assortment of formulations and methods. Keep that assortment well organized in your head. Learn about what all of the tools are, how to use them, and when they are best applied. Not every home improvement project needs a 6-axis CNC mill. Sometimes you just need a screwdriver. If you get a good feel for all of the tools in your chest, you never need to call a contractor.

1

]]>I can hear Jeff Linderoth groaning already.

*This is the second part of the live blog of Lecture 6 of my graduate class “Convex Optimization.” A Table of Contents is here.*

The promise of optimization is a clean abstraction between the math and the modeling (again, this is a key reason why I’m much more excited to be teaching optimization than machine learning). Once you hand a solver a model, the problem solving is purely deductive. Either the problem is convex or it is not. Either the problem is well-conditioned, or it is not. Either I can guarantee efficient solving, or I can’t. There is no guesswork about what the future looks like. That’s the modeler’s problem.

But all of the models themselves, as we are constantly reminded by the ghost of George Box,1 are wrong. Every time I go through examples of optimization problems, I’m struck by how many idealizations need to be made to get a tractable problem. I want to belabor this point with the simplest example that every linear programming class starts with: the minimum cost diet problem.

Educators love the diet problem because it’s easy to describe and, at first blush, is clearly linear. I start with a list of nutritional requirements that my diet must satisfy. For example, I could specify a maximum number of calories, a minimum amount of vitamins, and a minimum amount of protein. I then get a list of foods from my diet tracking app and look at their nutritional content. This content is proportional to weight: every 100g of spinach contains 3g of protein, 2g of fiber, 100mg of vitamin C, etc. I can make a similar nutrient content list for every food I might consider consuming.

A nutritional requirement is then a linear constraint on the allocation of foods in my diet. If I eat twice as much of some food, I get twice the nutrients. If I eat two foods, I get the sum of their nutrients. Moreover, if I am buying this food slurry in bulk from the Berkeley Bowl, the price is proportional to the weight as well, so the cost is additive. With enough assumptions, I can formulate finding the diet of minimum cost that satisfies my nutritional requirements as a linear program.

But is it really? The variation of nutrients in food is unknown. In packaged foods, the USDA allows for 20% errors in calorie count alone. And fresh food is much more highly variable. The nutritional content of an apple varies with size, ripeness, and variety. Tart apples have considerably more Vitamin C than sweet ones, and the amount of vitamin C decays over time after the apple is picked.

And what about the constraints on nutrients? How do we set those? The USDA minimum nutrition guidelines all contain guesswork. Recommended allowances of nutrients are based on averages, observing how much of the consumed nutrients come out the other side (yeah, I know, it’s gross). Many of the daily allowances are inflated by engineering factors just to be safe. If you *really* care about minimum cost, maybe you eat less than the recommended allowances. If you’re a multivitamin parishioner, maybe you want to exceed these numbers by a factor of 10.

I could go on.

Ironically, the diet problem was invented by economist George Stigler to argue that the entire idea of a minimum cost diet was an absurdity. Stigler, who would go on to be an influential figure in the Chicago School and would win the economics fake Nobel prize in 1982, was a rabid antiregulation, small government conservative. In his 1945 polemical screed, “The Cost of Subsistence,” he argued that USDA recommendations about minimum cost diets weren’t scientific. In the process, he wrote up one of the first linear programs—several years before George Dantzig would coin the term. But Stigler’s linear programming solution was polemical. He argued that there was an “almost infinite complexity of a refined and accurate assessment of nutritive value of a diet.” Moreover:

“...the particular judgments of the dieticians as to minimum palatability, variety, and prestige are at present highly personal and non-scientific, and should not be presented in the guise of being parts of a scientifically-determined budget… these cultural judgments, while they appear modest enough to government employees and even to college professors, can never be valid in such a general form.”

Stigler’s proposed minimum cost diet was particularly grotesque. Every day you get to eat one pound of flour, two ounces of evaporated milk, five ounces of cabbage, one ounce of spinach, and five cans of navy beans. Seasonings add to the cost, so you can’t have any of those. Have fun with that. But it cost 39.93 per year in 1939.2 It’s closer to 800 dollars a year today. Maybe this could work for you FIRE folks out there. Just put it in a blender and drink it like Soylent. For everyone else, this diet is a war crime.

Even though it was introduced as a polemical screed against the absurdity of government nutrition recommendations, we now start every linear programming class with the diet problem as an example. Even the simplest optimization model we present in our courses is loaded with approximations, idealizations, and value judgments. What do you think happens with any of the more complicated optimization problems we discuss? In class yesterday, I also introduced Markowitz portfolio optimization, which requires modeling the future returns on investments. Portfolio optimization rests on far shakier ground than diet planning.

A central lesson I want us to take away from this semester is that just because we can find optimal solutions to well posed optimization problems, the problems themselves are all idealized models. We have to interpret their policy suggestions accordingly. How optimization models are useful is a domain specific question that is necessarily outside the scope of the course. Having laid out my necessary caveat emptor, we can safely dig into the abstract convex geometry of duality theory.

1

Today’s blog only talks about dead dudes named George.

2

]]>George Dantzig would later lead a team to solve Stigler’s LP exactly, finding that the optimal solution was 39.69. By hand, Stigler had solved an LP with 86 nonnegative variables and 9 equality constraints and was off by less than 1%. In machine learning today, we consider that solved to optimality.

*This is the live blog of Lecture 6 of my graduate class “Convex Optimization.” A Table of Contents is here.*

For Boyd and Vandenberghe, optimization is a form of declarative programming. We express a mathematical logic for a mapping rather than a procedure. This leads to an idiosyncratic approach to programming, hinging on mathematical modeling and algebraic manipulation.

Let me explain this a bit through a sequence of examples. A canonical first problem introduced in an optimization class is the *diet problem*. The goal is to find a diet that satisfies all of the recommended daily allowances of nutrients from the USDA. I won’t belabor it here, but for any set of foods, the allotments that get you your daily allowances are described by a set of linear inequalities. Hence, to get a diet, potential diet, I can write the optimization problem

```
minimize 0
subject to the diet hits my daily allowances
```

There will be an infinite number of diets that meet your daily allowances, but if you pass this program to a solver, it will consistently return a single one. It will return a single diet, not the set of all diets that match the constraints. You have to understand the guts of the solver to know which one you will find. But maybe you don’t care and just want the solver to give you an answer.

If you don’t want to hand this ambiguity to the solver, you can narrow down the options by specifying an objective to rank all of your diets. Following the canonical example, you pick the diet of minimal cost. Specifying the prices of food and your desire to be cheap *declares* to the solver* which *diet you want it to compute for you.

In the idealized diet example, a modeler comes to the table with a clean mathematical model of their constraints and a clean argument about preferred solutions. The solver is then nothing more than a sorting machine. It effectively lists all possible solutions and extracts the minimum along some axis.

But optimization is far more powerful than this. It’s possible to write problems as optimizations where the feasible set has no apparent connection to the problem you’re trying to solve. Here’s the canonical *weird* example. Let’s say I have a list of roads, the points they connect, and their lengths, and I want to find the shortest route from A to B. I can do this with the following model. I’ll make a vector x where each component is an endpoint in my grid of roads. I’ll make an array of distances len, where len[u,v] is equal to the length of the road if there’s a road between u and v and equal to infinity otherwise. Then, I can solve the optimization problem:

```
maximize x[B]
subject to x[A] = 0
and x[u] - x[v] <= len[u,v]
```

Feeding this problem into a linear programming solver will return some vector. ~~The nonzero indices in this vector correspond to endpoints in the shortest path. ~~This vector doesn’t even give you the shortest path! While the B component will hold the *length* of the shortest path from A to B, you need to access something called the *dual solution* to retrieve the shortest path (Thanks to Matt Hoffman for spotting the error which I’ve left here in strikethrough).

This formulation is decidedly different from the diet problem. Your average modeler would never have figured out how to write the problem this way. A feasible solution to this problem doesn’t necessarily have anything to do with a valid path from A to B. For example, the vector of all zeros is feasible. The only interesting feasible point here is the optimal solution. And yet, once we’ve written the problem this way, we can feed an array of road lengths into a linear programming solver. It will compute the right solution. We have specified a function that maps distances to paths by specifying an optimization problem.

Optimization gives us a complex and versatile modeling language for functions that specifies functional mappings. We’ll see countless other examples of this in the class. In machine learning, the input is a collection of patterns, and the output is a prediction function. In portfolio optimization, the input is a model of asset prices and their correlations, and the output is an investment strategy. By writing functions this way, we can avoid thinking about algorithmic particulars and worrying about proving correctness. If our model is correct, then our function is correct. This simplifies the problem of verification. It’s like SQL but for policy.

Now, I’m belaboring these modeling choices today because the goal of Boyd and Vandenberghe is to give you as wide a modeling toolkit as possible. How can you turn an arbitrary problem into a specification of convex constraints and costs? Much of Chapter 4 focuses on how you can manipulate models to turn them into convex problems. One of my favorite lines from the chapter is:

“We call two problems equivalent if from a solution of one, a solution of the other is readily found, and vice versa. (It is possible, but complicated, to give a formal definition of equivalence.)”

This definition of equivalence is impossibly broad. It tells us to devote time to worrying about equivalence of formulations, operations that preserve optimality, and tricks like Ford’s formulation of shortest paths. If you can screw your head around enough algebra, a variety of complicated problems can be specified as an optimization pipeline: ETL fed into a convex solver followed by simple postprocessing. The goal of the class is to sketch out just how much you can do with that deceptively simple pipeline.

]]>*This is the second part of the live blog of Lecture 5 of my graduate class “Convex Optimization.” A Table of Contents is here.*

I had slated a bit of time at the beginning of Thursday’s class to work through a few examples of how to prove functions were convex. But this ended up consuming the entire lecture. I always forget that every first-year grad class needs to fit in at least one stealth tutorial on linear algebra.

It was true for my cohort, and it’s still true now: everyone who comes to engineering research needs to take a class to understand how engineers think about linear algebra. For me, this was Detection and Estimation with Greg Wornell. For you, maybe it’s Convex Optimization with me. Regardless, it takes a lot of practice to get a feel for applying linear algebra. It requires a sophistication with algebra, geometry, and analysis, and sometimes it takes doing it a dozen times before it really sinks in.

Some facts are underappreciated and taxing. Linear algebra is noncommutative (AB is usually not equal to BA), so a lot of the intuitions you get from solving polynomial equations in high school go out the window. None of the formulas simplify the way they should. Linear equations like AX + XA^{T} = Q don’t have clean, interpretable solutions. Eigenvalues are weird, and you need to see a dozen examples of their applications before you appreciate why professors are so obsessed with them. Singular values are more important than eigenvalues, yet almost no one really learns what a singular value means in college. You can sort of do algebra on matrices. A^{2} makes sense. For positive definite matrices, A^{1/2}, the square root of A, is well-defined. However, the properties of these algebraic expressions don’t match our intuition from high school algebra.

In addition to Boyd and Vandenberghe’s books, one of the most magical PDFs I received in graduate school was a set of notes on linear algebra assembled by Tom Minka in 2000 (thanks to Ali Rahimi for sharing these magic tricks). Tom put together a bunch of facts that he thought were useful for statistics. The formulas contained therein are so esoteric and confusing that I still have to open this PDF on a regular basis. Let me give a crazy example.

The determinant is one of the more esoteric formulas we encounter in linear algebra. For a positive definite matrix, the determinant is just the product of the eigenvalues of the matrix. The determinant can be used to calculate ellipsoidal volumes, the entropy of probability distributions, and designs for experiments. Though a useful modeling tool, it’s also not a function we’d ever want to compute by hand for matrices with more than three rows and columns.

It turns out that the determinant is log-concave on the cone of positive definite matrices. That’s a wild fact. *Why* is it log-concave? Boyd and Vandenberghe have a short proof that uses the fact that convex functions are convex if and only if they are convex when restricted to lines. But the steps in the proof are pretty confusing if you don’t have linear algebra tricks at your fingertips. It definitely led to a lot of confusion in class yesterday! We needed to know that the determinant is a multiplicative homomorphism (det(AB) = det(A)det(B)), that every positive definite matrix has a square root, and that adding a multiple of the identity to a symmetric matrix adds that multiple to eigenvalues. For a first-year grad student, this is a wildly disparate set of facts. Linking the necessary linear algebraic patterns together takes a course or two to get straight.

Another way to prove the log-concavity of the determinant is to compute the Hessian. But this approach also takes us in weird directions! The Hessian has to be a matrix, but the inputs to the functions are n x n matrices. That means the Hessian is an n^{2 }x n^{2} matrix. Tom computes this in equation 107 in his note:

A tensor product? Yikes. There is something perpetually unintuitive about derivatives of matrix objects. I suppose this sort of feels right because the second derivative of the logarithm is -1/x^2. However, I’m guessing almost no one sees Kronecker products in undergraduate linear algebra.

If you believe the formula, this proves that the log determinant is concave. Since X is positive definite, its inverse is positive definite. The tensor product of two positive definite matrices is also positive definite. This then means the Hessian is negative definite, and the function is hence proven concave.

If that discussion doesn’t feel satisfying, don’t fret too much. I was a math major in college, and I came into graduate school knowing what tensor products were, what eigenvalues were, and what determinants were. Nonetheless, it still took me a couple of years of work to internalize computational linear algebra. There’s an underappreciated art to it that you learn the more you do optimization, statistics, or signal processing.

Now, a question in our age of LLMs is do you need to know this? If you are going to do innovative research, the answer is decidedly yes. Deep neural networks have nothing to do with the brain. They have everything to do with pushing around derivatives of functions of tensors. If you want to understand what neural nets do and how to make them better, you need intuition for the matrix analysis under the hood.

The fun part of neural networks is the promise that no one has to understand linear algebra. You just have to be able to call an automatic differentiator and tweak some existing model, and you get a conference paper or push a feature to production. Automatic differentiation tools exist so you never have to compute derivatives anymore. Maybe you don’t need to think about eigenvalues and determinants and such. Maybe you can just pytorch pipette your way to papers. That’s possible. But if we choose this path, what do we lose?

]]>*This is the live blog of Lecture 5 of my graduate class “Convex Optimization.” A Table of Contents is here.*

One of the main points of emphasis of this course is being able to *see* convexity in nonconvexity. Some modeling problems that feel very nonconvex can be massaged into a list of convex constraints and costs that are potentially solvable.

One of the simplest sets of nearby models are functions that are convex after applying an invertible function. Let’s say we have some invertible function h and consider the family of functions where h(f) is convex or concave. For lack of a better naming convention, we could call such a function h-convex or h-concave. Any such functions should be easy to optimize because all you’d have to do is apply h and then perform local search. You can also use h-convex functions in constraints. If f is h-convex, the set f(x)≤B is convex because it is equal to the set h(f(x))≤h(B). Hence if you end up with an h-convex or h-concave function in your modeling problem, don’t fear.

The most common instance of this sort of thing is log-concavity. A function is log-concave if its logarithm is concave. Log-concavity gets its own special name because of its ubiquity in probability and statistics. Almost all of the probability distributions we learn in statistics are log-concave. The normal distribution, the exponential distribution, the uniform distribution, and the logistic distribution are all log-concave. For most parameter regimes of interest, the Wishart distribution, the Dirichlet distribution, the gamma distribution, the chi-square distribution, the beta distribution, and the Weibull distribution are also log-concave.

Finding the maximum of the density of a log-concave distribution can be done by local search. For many such models, maximum likelihood estimation is similarly a convex optimization problem (statisticians *love* maximum likelihood estimation). Another surprising feature of the log-concave distributions is that they are closed under statistical operations. A sum of independent log-concave random variables is log-concave. Moreover, the marginal distribution of log-concave distributions is also log-concave.1

These computational facts explain more than anything else why we use these distributions. We know almost nothing is really normally distributed, but it’s so convenient to use the normal distribution that we throw up our hands and force ourselves to see the world as if it were normal. Hidden convexity explains many of our modeling habits.

Beyond log-concave functions, there are log-convex functions, exp-convex functions, exp-concave functions, and so on. You can and should add all of these to your modeling toolbox.

Beyond the simple notion of h-convexity, we can take a further step into the weird by looking at *quasiconvex* functions. Quasiconvexity generalizes the idea of *unimodality*. If a function is nonincreasing up to some point and then nonincreasing after that, it should be easy to optimize:

Such a function is called unimodal. One key feature of this function is all of the sublevel sets (the set of all x whose function values are less than B) are intervals.

Quasiconvexity generalizes this phenomenon to higher dimensions. A quasiconvex function is one whose sublevel sets are convex. In other words, constraints of the form f(x)≤B define convex sets for any value of B. This means that they can perhaps be used as constraints in convex optimization problems. The only tricky part is deriving algorithms to handle such constraints. We’ll see some ways of handling this later in the semester.

Unimodal functions can be optimized using a bracketing search that successively narrows down the interval where an extreme point occurs. In other words, these are functions that can be minimized using

`scipy.optimize.minimize_scalar()`

In higher dimensions, quasiconvex functions can often also be minimized by a generalization of this algorithm. If you can check if there is a point less than a prescribed value, then you can globally optimize a quasiconvex function.

Quasiconvex functions are pretty weird, though. The function f(x,y) = xy is quasiconvex. Getting weirder, the ceiling function that returns the smallest integer greater than or equal to a point is quasiconvex:

Even weirder, the function that counts the number of nonzero elements in a vector is quasiconvex on the nonnegative orthant. Quasiconvexity is, in other words, probably *too* general. We’ll see examples of solvable problems in the class, but these tend to be special cases. Nonetheless, a core part of this course is identifying the nonconvex problems that we can solve with convex optimization. Many quasiconvex problems fall into this bucket.

1

]]>Both of these facts require some annoying analytic manipulation of integrals. See this paper if you’re interested in the gory details.

*This is the live blog of Lecture 4 of my graduate class “Convex Optimization.” A Table of Contents is here.*

Though I made a big deal last week about how we could reduce all optimization problems to minimizing a linear cost subject to constraints, most people think of cost functions as the central part of optimization. Even though it’s easier to build models by specifying constraints, the act of optimization is easier to conceptualize using functions. Optimization implies a notion of pricing and the existence of a solution of minimal price. Moreover, if I can formulate an optimization problem as finding some point that achieves minimum cost, I can search for such a problem by finding clever ways to reduce the cost at each step.

Since Newton, mathematicians have realized that dynamic search methods can find solutions of minimum value. The most popular search method, usually attributed to Cauchy, is the method of steepest descent. This very greedy method tells you to find the direction that maximally decreases your cost and follow that direction.

Steepest descent is something you learn in calculus. The steepest descent direction at any point is in the direction of the negative gradient. If the gradient is zero, there is no local information about how to make improvements. However, there are weird functions even in 1D, like *x*^{3}, where the gradient equals zero at the point *x*=0, but any finite step in the negative direction reduces the cost. Convex functions are the ones for which Cauchy’s method always finds the minimum cost solution. And the proof almost follows from the definition.

A function is convex if and only if the plane tangent to the graph of the function lies below the function at every point:

.

Writing this in math says that for all x and all v

So if the gradient is equal to zero at x0, that means f(x)>= f(x0) for all x. If you get your definitions right, all theorems are trivial.

Last week, we needed to talk about feasible cones and separating hyperplanes to test optimality. But the condition here for convex functions is so much simpler. A point is a global minimizer of a cost function if the gradient of the function at that point is equal to zero.

If you can model your cost with convex functions, a simple algorithm always finds the optimal solution. The only question that remains is whether you can model your problem with convex costs. Just like last week, the answer will be to have a brute force definition that is sometimes easy to check and a bunch of rules for building complex convex functions from simple ones.

For this modeling program, we should start with the more common definition of convex functions: a function is convex if, for any two points, the line segment connecting their function values lies above the graph of the function. That is, a convex function is one where the area above the graph is a convex set. Here’s the simplest picture.

Convex functions are all boring like this. Their boringness is their appeal.

Even in high dimensions, convex functions are roughly valley-shaped, perhaps with some sharper edges. But the functions you can prove convex are wild. Feel free to skim these examples, but here are some of the more exotic ones I’ve seen (oddly, all of them introduced to me by Joel Tropp).

If H hermitian, X positive definite, f(X) = -trace(exp(H + log(X)) is convex. Here exp and log are the

*matrix*exponential and logarithm.If X and Z are positive definite, f(X,Z) = trace( X (log X - log Z) - (X - Z)) is convex.

If f is any complex differentiable function on the subset of complex numbers with real parts between

*a*and*b, f*(*x*) = sup_{y}| f(x+iy) | is convex on the set [a,b].

Wut. Verifying the convexity of these particular examples is a real pain. The first one even has a name associated with it (Lieb’s Theorem). But for those of us who just want to model, it’s better to take the building block approach, again looking at operations that preserve convexity and using these to build up examples.

**Nonnegative Combinations**. Any linear combination of convex functions with nonnegative coefficients is convex. You can prove this using the fact that inequalities are preserved under nonnegative combinations. This means that integrals of convex functions are convex, as are expected values.

**Composition with Affine Functions.** If f(x) is convex, A is a matrix, and b is a vector, then g(x) = f(Ax+b) is convex.

**Maximization. **The pointwise maximum of several convex functions is itself a convex function. In mathematics: f1, f2, …, fk are convex, then g(x) = max_{i} f_{i}(x) is convex.

**Partial minimization*** . *If f(x,z) is convex, then inf_z f(x,z) is convex.

**Composition with scalar functions. **If f is convex, and g is convex and nondecreasing, then h(x) = g(f(x)) is convex.

With these simple rules, you can build up a giant library of convex functions. You could argue the only ones you really need are affine functions a^{T} x + b. If you just think about this geometrically, a convex function is equal to the pointwise maximum of all of the affine functions that lie below its graph. This is the same as saying that a convex set is the intersection of all halfspaces that contain it.

But such sophisticated mathematical abstractions aren’t necessary for modeling. Instead, we just need to composition rules. We can get some outlandish functions with the building blocks alone:

The sum of the squared residual errors in a linear model

The sum of the k largest components of a vector

The negative geometric mean of a bunch of concave functions

The Kullback Liebler divergence between two probability distributions

The maximum eigenvalue of a positive semidefinite matrix

All of these functions are convex and can be proven so using the simple composition rules. And that means that they can be minimized by local search. Disciplined modeling guarantees efficient ways to find global minimizers. The only question (and it’s a hard one) is whether you can find a good model for your problem only using these rules.

]]>*This is the second part of the live blog of Lecture 3 of my graduate class “Convex Optimization.” A Table of Contents is here.*

In the last lecture, I argued we could verify a solution to a convex optimization problem was optimal by showing that two sets were separated by a hyperplane. If this is your first time seeing it, it’s a weird way of thinking about proof. We take something that feels like we need epsilons, deltas, and calculus and transform it into a question of geometry.

Take two disjoint convex sets, *C* and *D. *A separating hyperplane is one where *C* is contained on one side of the hyperplane and *D* is contained in the other. In algebraic terms, *C *and *D* are separated by a hyperplane if there is an affine function *h*(x) that is nonpositive on *C* and nonnegative on *D.*

Suppose these are two points, *c* from *C *and *d* from *D,* whose distance is equal to the distance between *C * and *D.* *c *is the closest point in the set *C *to the set *D* and vice versa. Take the line segment that connects *c *and *d*. Define the hyperplane that cuts through the midpoint of that segment and whose normal vector is the direction from *c* to *d.* Then this hyperplane separates *C and D.* Proof by picture:

One halfspace contains *C* and the other *D.* If you don’t like proofs by picture, you can verify this with a few lines of algebra (see Section 2.5.1 in BV).

It came up in class that this picture looks like the one shown in machine learning classes when introducing maximum margin classification. If you are into the machine learning terminology, *c *and *d* are support vectors, and the distance is twice the margin between the sets *C* and *D.* This isn’t a coincidence: the connections between convex programming and pattern classification have been around since the original heyday of the Perceptron. In 1961, Bill Highleyman noted that two sets were linearly separable in the sense of Rosenblatt if their convex hulls were disjoint. In 1965, Olvi Mangasarian then realized that pattern classification could be solved by linear programming, hence inventing what would be renamed support vector machine in the 1990s.

In the blasphemous language of machine learning, in order to have a strict and robust separation of two convex sets, you need to have a good margin. But what about when you don’t? The reason I defined separating in terms of “nonpositive” and “nonnegative” rather than “negative” and “positive” is that there are disjoint sets where there is no hyperplane achieving strict positivity and negativity. Take this example (which I couldn’t come up with in my head in class yesterday, but Dan Glaser and Aathreya Kadambi suggested it after class):

If you prefer equations to pictures, two sets here are

They are certainly disjoint, but there is no margin between the sets. The only separating hyperplane is the one where *x*=0. But this does not strictly separate the sets. The affine function is equal to zero for some points in *C*.

One case where there is always margin is separating a point x from a closed convex set *C*.1 This leads to a wild corollary. I argued in Lecture 1 that a closed convex set is equal to the intersection of all of the halfspaces that contain it. To see that this is true, take any point not in the set. Then, by the separating hyperplane theorem, there is a halfspace containing the convex set that doesn’t contain the point.

Now, why should we care about these separating hyperplane theorems other than because of these cute proofs? Here’s an example that came up in class (note to self: I will learn the name of the person who suggested this on Tuesday for proper attribution), but I had forgotten the rigorous statement:

Does there exist a solution x to the system of equations Ax=b*,* which has all nonnegative entries?

If you have a candidate solution, you can check it by verifying it’s nonnegative and plugging it into the system. How can you check if no solution exists? *We can find a separating hyperplane*.

Let *C* be the set of all nonnegative combinations of the columns of *A*. If you can find a hyperplane separating *C* and the point *b,* then the system doesn’t have a solution. If you work through the algebra, you’ll see that this amounts to finding a vector y such that *A*^{T}*y* ≥ 0 and *b*^{T}*y* < 0. We can verify one polyhedron is empty by finding a single point in another polyhedron.

This particular separation result is called Farkas’ Lemma. It is the first time convex programming duality has reared its head in the class. We can turn problems of minimization into problems of checking emptiness. We can turn problems of checking emptiness into searching for separating hyperplanes. And then we can turn problems of searching for separating hyperplanes into problems of maximization. To every convex optimization problem, there’s a dual convex optimization problem. This is a central theme in convex optimization.

Duality is why convex optimization is a relatively easy problem. Checking if there exists a point with optimal value less than some number can be done by plugging in a solution. Proving there exists no point with optimal value less than some other number can be done by providing a separating hyperplane. You can verify a hyperplane is valid by plugging it into the dual problem and making sure it’s feasible. This means that verifying upper bounds *and* lower bounds of the objective function is easy.2 This duality puts us one step away from proving we have efficient algorithms.

1

Since *C* is closed, its complement is open, so we can put a ball of some radius *r* around *x* and not intersect *C*. Separating the ball from *C* strictly separates the point from *C.*

2

]]>For the complexity theorists, it means we’re in NP ⋂ coNP.

*This is a live blog of Lecture 3 of my graduate class “Convex Optimization.” A Table of Contents is here.*

A minor quibble with my good friends Stephen and Lieven, but I already have a disagreement with lecture order. Next time I teach this class, I’d put today’s lecture first and Tuesday’s lecture second. Because today, I want to motivate *why* we care about convex sets.

Convex optimization, roughly speaking, is the set of all problems where naive *local* search always finds *global* solutions. Let’s say we want to minimize some cost function subject to some constraints. We start at a point that satisfies the constraints but maybe doesn’t have an optimal value. We look for a direction that (a) stays inside the set and (b) makes the function we want to minimize smaller. We then move along that direction while making sure we don’t go too far so that the constraints are violated.

Repeating this process of direction finding and solution updating, one of two things will happen. We could keep finding directions of improvement forever because our cost function is unbounded below (e.g., minimizing a linear function in one variable) or because the cost keeps improving but we never hit a boundary (e.g., minimizing the exponential function in one variable). If we don’t get stuck in an infinite loop, we will get to a place where we can’t find any reasonable improvement anymore. Does this mean we’re at an optimum?

In general, we have no idea about where this naive local search will take us. A simple picture illustrates the issue:

Here, the shaded region is the feasible solution and I’m assuming we’re just trying to minimize the function *f*(*x*,*y*)=*y*. We just want to find the point closest to the bottom of the graph.1 If we start at the big black dot, there’s no way to make any local progress, but we can see far better solutions to this problem. This is, it turns out, only the tip of the iceberg of pathologies you encounter in non-convex land.

But now look at this beautiful convex cartoon.

The only place where local search can terminate is precisely at the star. Local search has no choice but to find a global solution. We can prove this rigorously too. Suppose we have another point in the set with a lower y-value. Then along the line connecting the star to this point, the y-value decreases the entire way. Since the set is convex, the line connecting the star and our new candidate solution has to stay in our set. But that means there was a direction from the star that would have lowered our cost. This contradicts the star being a local solution.

The proofs that local solutions are globally optimal for convex optimization are deceptively simple. The minimal assumption you need to seems to be convexity. I think about these proofs as *assumed* into the model. Convexity is forced upon us if we insist local search is valid.

One thing to keep in mind is that you can have a ton of globally optimal solutions. Here’s an example where there is an infinite collection of solutions at the bottom.

All of the solutions on the bold line are globally minimal solutions. We can’t guarantee a unique global solution. We only guarantee that local search through the set only terminates at one of the global minimizers.

Now, there’s one extra ingredient that forces our hand. We need to have some way of checking whether or not a direction will stay in the set. At a bare minimum, we must have an efficient way of checking if we’re in the convex set. This motivates what Stephen and Lieven call “*disciplined* convex programming.” The Logo rules in the class will ensure that the sets we construct are easy to search over.

Even more amazingly, the Logo rules of convexity let us *prove* that our solutions are local. If I’ve written some optimization code that is stuck at some solution and can’t find a way to improve, how can I know if it’s my problem or if the code has found a legitimate local solution? Maybe I’m just bad at finding search directions. In general, proving a point is a local minimum is ridiculously hard. I don’t put too much stock into P vs NP mumbo jumbo, but people can construct all sorts of pathological examples where there’s a nice direction downhill, but no one has a practical way of finding it.

In disciplined convex land, there’s often an algorithmic path to *verifying* local (and hence global) optimality. Define two sets: First, let *S _{d}* be the set of all directions that make the cost function smaller. In our cartoons, these are all pairs (

One way to prove the sets are disjoint is to find a function *h* such that *h*(*x*,*y*) is negative for all (*x*,*y*) in *S _{d}* and nonnegative for (

1

]]>By adding one variable to the problem, I can always assume we’re just trying to minimize the linear function cost(x) = x0. Here’s the silly trick: to minimize a function f(x), I can always find the minimum value of t subject to (x,t) subject to f(x) <= t and x lying in the original constraint set.

*This is a live blog of Lecture 2 of my graduate class “Convex Optimization.” A Table of Contents is here.*

This first technical lecture is about the Logo of convex sets, a “programming language” for generating all possible convex sets. I’ll tell you three rules that preserve convexity, meaning that if you apply the rules, you get convex sets back. From these three rules, we can construct *all* convex sets from *exactly two* convex sets: a line and a ray. What I appreciate most about Boyd and Vandenberghe’s book is this algorithmic perspective of geometry.

So let’s first start with the definition of what it means to be convex. A set of vectors is convex if it contains all line segments between all pairs of vectors. That’s it. Here is an example of a convex set and a nonconvex set from BV’s book.

Convex polygons are a great example to keep in mind as a fundamental convex set. Another important example is the ellipsoid.

We’ll generate many more as we go on.

Convexity has such a deceptively simple definition. It’s always wild in math how you can take a simple concept and turn it into an expansive research area. In the next lecture, I’ll tell you why it’s the right one as the primitive for algorithmic optimization. You will have to bear with me for some delayed gratification.

Now let me give you some simple rules for building convex sets.

**RULE 1: Concatenation.**(also called Cartesian product). Take two convex sets*C1*and*C2*in any two spaces. The concatenation is the set of all pairs from*C1*and*C2*stacked on top of each other. That is, you take an*x*from*C1*and a*z*from*C2*and return the vector [*x*,*z*]. I’ll denote the set of all such concatenations by*C1xC2*. If*C1*and*C2*are both convex, then*C1*x*C2*is convex. This is because a line segment in*C1*x*C2*is just a concatenation of line segments in*C1*and*C2*.**RULE 2: Intersection.**Now take*C1*and*C2*to be two convex sets in the same space. Then their intersection*C1*∩*C2*is convex. The intersection is just the set of all points that are both in*C1*and*C2*. If you have two points that are both in*C1*and*C2*, the line segment between them is in*C1*, and it’s also in*C2*. Easy peasy so far.**RULE 3:****Affine images and preimages.**This last rule is the trickiest to verify. You’ll have to break out a sheet of paper to check that I’m telling the truth here. Take a convex set*C*, a matrix*A,*and a vector*b*. Then I can apply*A*and add*b*to every vector in*C.*The results set of vectors is an*affine image*of*C.*Any affine image is a convex set. I can also look at the set of all points x such that*Ax*+*b*is in*C*. This is the*affine preimage*of*C*. Both the affine image and preimage are convex sets.

I think the first two rules are a bit easier to understand than the last rule, so let me explain a couple of primitive affine transformations. The first one is scaling. If you multiply every element of a convex set by a positive number, you are either enlarging or shrinking the set. A second affine transformation is translation. If you add the same vector to every element in a convex set, you are sliding the set around in space. Affine transformations are a modestly more complex mixture of these two primitives.

Now, let me construct all convex sets using these three rules: a ray and a line. I’ll do this in stages. The first step is to introduce a primitive *half space*. Let *H0* be the set of all vectors that are nonnegative in the first entry and arbitrary everywhere else. This space is the concatenation of a ray (first coordinate) and a bunch of lines (all the other coordinates). *H0* is called a half space because it divides my vectors into two sets of “equal size.” Half of my vectors are positive in the first coordinate, half are negative in the first coordinate, and we’re taking the first half.

Now, I can make other half spaces. If I rotate *H0*, I can cut space in half along any direction. Rotation is an affine transformation, so all of these are convex sets. And I can translate a half space around. These are still half spaces. I’m going to assume my space is infinite, so the origin was an arbitrary center of its universe. This funny generative process lets us see what a half space is: take any line in space, pick a point on the line to be the origin, and then take all the vectors on one side of that origin.

Alright, with two of the three rules, we have generated all possible half-spaces. And with the final rule, intersection, we get everything. In two dimensions, if I intersect a bunch of half-spaces, I’ll get convex polygons like the one pictured above. In higher dimensions, we call these objects *polyhedra*. We let polyhedra have infinite extent like in this picture:

Polyhedra are the most important convex sets, and we’ll spend a lot of time on them. We can get even crazier shapes if we let ourselves take an infinite number of intersections. For instance, the circle is the intersection of all half spaces translated a unit distance from the origin:

I’ll come back to this in the next lecture, but it’s more or less the case that a set is convex if and only if it’s an intersection of half spaces (though you might need an infinite intersection).

The important takeaway from this first lecture is once we had the rules of composition, we never had to write proofs arguing about line segments. Modeling convex sets was then just a matter of seeing what you could create with the primitive operations. And the answer is much more vast than you might first imagine.

]]>