This is the live blog of Lecture 11 of my graduate class “Convex Optimization.” A Table of Contents is here.
The most underappreciated part of Boyd & Vandenberghe is the endless trove of examples. Three chapters of the book are dedicated to modeling examples, and we’re now moving into the part of the course where we apply the theory to modeling.
We start with a topic dear to my heart: linear inverse problems. I’m casting a wide net with this term here.1 Let’s skip the linear part for a second. What do I mean by an inverse problem? Imaging is perhaps the easiest problem to describe. We imagine some 2D or 3D representation of the world we’d like to capture through a measurement device and want to use what we know about our measurements to build the image.
In a CT scan, we send lines of X-rays through your body and then use a computer to decode an image of your insides. MRI is similar, except it uses complex measurements of the magnetic fields induced by molecules in your body. Neither of these modalities takes “images” in the way we think a camera does, yet we can use algorithms to decode what’s inside. These are canonical inverse problems. We measure the state of the world through some process, and we’d like to determine the state from our measurements.
Even cameras solve inverse problems these days. Algorithms in your phone can take multiple blurry snapshots and use algorithms to make them into a deblurred image. Similarly, multiple images can be combined to yield high dynamic range when the lighting is unfavorable.
All of these modalities combine a forward model, which simulates the physics of what happens between the world and the imaging sensor, and an image model, which builds in what we think should be true about our final image. Algorithms then combine these together to produce a final reconstruction.2
Surprisingly, all of the examples above can be computed using linear models. The mapping from the world to the measurement is a linear function, insofar as if you “add stuff,” the resulting measurement is their sum. For example, the measured output in a CT scan is the amount of X-rays that made it through the body. This decreases linearly in the amount of absorbing material between the X-ray emitter and detector. An image blur is a linear operation that adds together multiple views of a scene. A linear inverse problem is one where the mapping from reality to our measurement is modelable as a linear transformation.
There are surprisingly many problems that can be posed as linear inverse problems. A few examples include finding a model of a dynamical system that predicts time series data, estimating biomarkers relevant to particular disease outcomes, and finding the delays in GPS signals so that you can triangulate and find your location. It’s popular to use linear maps to model your content preferences based on your past consumption behavior and all of the consumption behavior of everyone else on the internet. I could do a whole blog series on inverse problems and their applications. For this course, we’ll settle with one post.
If we have a linear model, we are saying
measurement = forward model ( image ) + noise
Where the “forward model” function is a linear mapping. Our goal is to figure out what the image is by removing the noise and inverting the forward model. The problem here is that there is an infinite set of images and noises that produce the same measurement. What do you attribute to noise? What do you attribute to signal? The answer is to come up with some cost function that balances attribution. The general linear inverse problem is then given by an optimization problem of the form
minimize (1-h) * implausibility(noise) + h * implausibility(image)
subject to measurement = forward model (image) + noise
where h is a constant between 0 and 1. The implausibility functions are different for the noise and the image and are part of the modeling process. You want to pick functions that are easy to optimize (hint: convex) and large for implausible realizations of the signals you care about. If you were Bayesian, you might call these priors about the signals in your problem. But prior is a loaded word in Bayesian land, implying things about probabilities and whatnot. Today, I’m going to stick with calling them implausibility functions to avoid statistical mess.
The most common implausibility functions we use date back to Gauss. One example is the mean-sum-of-squares, which captures some notion of a signal’s variability. Another implausibility penalty would be the sum of absolute deviations. This function is useful when you expect the noise process to be bursty or sparse.
Similarly, different convex implausibility functions encourage particular image structures. Sparse images are encouraged by penalizing the sum of absolute values. Piecewise constant images are encouraged by penalizing the sum of absolute values of the derivative. Smooth images are encouraged by penalizing the sum of the squares of the second derivative.
The modeling choices here are intimidating as it’s not clear what implausibility penalties you should pick. But there’s a rule of thumb that I’ve found helpful in linear inverse problems (and which I may have written a dozen or so papers about). The implausibility function is trying to encourage your algorithm to choose a simple signal. One means of simple is “the signal can be written as a short sum of basic signals.” For example, you might assume your signal is just a spike train, then it can be written as a sum of spikes. Sparse images are a sum of a few basic point sources. A low rank matrix is a sum of a few rank one matrices. In all of these examples, the assumption is that the signal is a sum of a few atoms. If you know the atoms, a reasonable implausibility function penalizes the coefficients needed to write a signal as a sum of atoms. In math, for those who like equations:
When the atoms are unit vectors, this is the Euclidean norm (root mean squared error). When the atoms are standard basis vectors, this is the sum of absolute values (l1-norm). There are tons of other options. No matter which atomic set you choose, this implausibility function is convex. The general linear inverse problem is thus posed as a convex optimization problem. This gives you a general cookbook for posing and solving inverse problems.3
Now what about that constant h? It is called a hyperparameter and is part of your convex model. Sometimes you know what that parameter should be, but other times you need to figure it out experimentally. But remember, this whole model is made up, so don’t fret if there are parts you can’t specify by pure reason alone.
I wish instead of having to use the dumb terms “machine learning” or “artificial intelligence,” we could say, “I work on inverse problems.” If only that carried the same cachet. Maybe I’ll work on improving our field’s marketing.
Today, I’m going to skip over the philosophy of why we’ve decided to trust these algorithmic reconstructions of reality. Why do we believe that what we see in an MRI image is really what’s inside a body? A fun post for another day!
If you want to read more about atomic decompositions and inverse problems, one of my favorite papers on my CV, The Convex Geometry of Linear Inverse Problems, with Chandrasekaran, Parrilo, and Willsky, describes this general view and its implications.
Thank you for the great blog series! Regarding footnote 2, "Today, I’m going to skip over the philosophy of why we’ve decided to trust these algorithmic reconstructions of reality.", do you have any pointers for reading material on this/what terms I should search for?
By constant c in the last paragraph you're referring to the noise-image tradeoff term, which is different from the c_i coefficients in the equation right?