This is the live blog of Lecture 5 of my graduate class “Convex Optimization.” A Table of Contents is here.
One of the main points of emphasis of this course is being able to see convexity in nonconvexity. Some modeling problems that feel very nonconvex can be massaged into a list of convex constraints and costs that are potentially solvable.
One of the simplest sets of nearby models are functions that are convex after applying an invertible function. Let’s say we have some invertible function h and consider the family of functions where h(f) is convex or concave. For lack of a better naming convention, we could call such a function h-convex or h-concave. Any such functions should be easy to optimize because all you’d have to do is apply h and then perform local search. You can also use h-convex functions in constraints. If f is h-convex, the set f(x)≤B is convex because it is equal to the set h(f(x))≤h(B). Hence if you end up with an h-convex or h-concave function in your modeling problem, don’t fear.
The most common instance of this sort of thing is log-concavity. A function is log-concave if its logarithm is concave. Log-concavity gets its own special name because of its ubiquity in probability and statistics. Almost all of the probability distributions we learn in statistics are log-concave. The normal distribution, the exponential distribution, the uniform distribution, and the logistic distribution are all log-concave. For most parameter regimes of interest, the Wishart distribution, the Dirichlet distribution, the gamma distribution, the chi-square distribution, the beta distribution, and the Weibull distribution are also log-concave.
Finding the maximum of the density of a log-concave distribution can be done by local search. For many such models, maximum likelihood estimation is similarly a convex optimization problem (statisticians love maximum likelihood estimation). Another surprising feature of the log-concave distributions is that they are closed under statistical operations. A sum of independent log-concave random variables is log-concave. Moreover, the marginal distribution of log-concave distributions is also log-concave.1
These computational facts explain more than anything else why we use these distributions. We know almost nothing is really normally distributed, but it’s so convenient to use the normal distribution that we throw up our hands and force ourselves to see the world as if it were normal. Hidden convexity explains many of our modeling habits.
Beyond log-concave functions, there are log-convex functions, exp-convex functions, exp-concave functions, and so on. You can and should add all of these to your modeling toolbox.
Beyond the simple notion of h-convexity, we can take a further step into the weird by looking at quasiconvex functions. Quasiconvexity generalizes the idea of unimodality. If a function is nonincreasing up to some point and then nonincreasing after that, it should be easy to optimize:
Such a function is called unimodal. One key feature of this function is all of the sublevel sets (the set of all x whose function values are less than B) are intervals.
Quasiconvexity generalizes this phenomenon to higher dimensions. A quasiconvex function is one whose sublevel sets are convex. In other words, constraints of the form f(x)≤B define convex sets for any value of B. This means that they can perhaps be used as constraints in convex optimization problems. The only tricky part is deriving algorithms to handle such constraints. We’ll see some ways of handling this later in the semester.
Unimodal functions can be optimized using a bracketing search that successively narrows down the interval where an extreme point occurs. In other words, these are functions that can be minimized using
scipy.optimize.minimize_scalar()
In higher dimensions, quasiconvex functions can often also be minimized by a generalization of this algorithm. If you can check if there is a point less than a prescribed value, then you can globally optimize a quasiconvex function.
Quasiconvex functions are pretty weird, though. The function f(x,y) = xy is quasiconvex. Getting weirder, the ceiling function that returns the smallest integer greater than or equal to a point is quasiconvex:
Even weirder, the function that counts the number of nonzero elements in a vector is quasiconvex on the nonnegative orthant. Quasiconvexity is, in other words, probably too general. We’ll see examples of solvable problems in the class, but these tend to be special cases. Nonetheless, a core part of this course is identifying the nonconvex problems that we can solve with convex optimization. Many quasiconvex problems fall into this bucket.
Both of these facts require some annoying analytic manipulation of integrals. See this paper if you’re interested in the gory details.
"For most parameter regimes of interest, the Wishart distribution, the Dirichlet distribution, the gamma distribution, the chi-square distribution, the beta distribution, and the Weibull distribution are also log-concave."
I disagree! These distributions (which are mostly just gamma distributions after one change of variables or another) are most interesting when their shape parameters are less than one. The range of parameters where these distributions are both log-concave and "interesting" (in the sense of being hard to approximate well with a Gaussian) is actually pretty small IMO.
That said, for each of these distributions, there exists a change of variables that makes them both log-concave and unconstrained, e.g.:
• Gamma: X ~ Gamma(α, β), Y = log(X), p(y) is log-concave
• Beta: X ~ Beta(α, β), Y = logit(X), p(y) is log-concave
• Dirichlet: Basically same as beta, but inverting the multinomial logistic function is annoying so I won't write it here
• Chi-Squared: Special case of gamma
• Weibull: Just a change of variables on the Gumbel, which is log-concave (and also just a change of variables on the Exponential, a special case of Gamma)
So I would argue that the lack of log-concavity in these distributions arises from looking at them the wrong way. In fact, I'd go further and argue that we don't really know how to construct useful distributions _except_ by changes of variables and compounds (e.g., student-t is just a scale-mixture of normals) applied to a handful of simple, log-concave distributions.
Hey Ben -- long time subscriber, second time commenter here. Having recently been granted the label of "statistician", I can confirm that statisticians *love* maximum likelihood. But what do you do if the log-likelihood is non-convex? After all, in this case it may not be computationally feasible to obtain your MLE estimate!
I recently heard Richard Samworth give a talk which I really liked. They proposed the following idea in the context of linear regression: you first use your data to estimate the best convex approximation to the log-likelihood, and then you optimize that best convex approximation to get a point-estimate. Pretty cool - figured you might like it too: https://arxiv.org/abs/2403.16688
It feels in line with the spirit of your post (or perhaps the next one) which is: when the objective isn't convex, CONVEXIFY