12 Comments
User's avatar
Matt Hoffman's avatar

"For most parameter regimes of interest, the Wishart distribution, the Dirichlet distribution, the gamma distribution, the chi-square distribution, the beta distribution, and the Weibull distribution are also log-concave."

I disagree! These distributions (which are mostly just gamma distributions after one change of variables or another) are most interesting when their shape parameters are less than one. The range of parameters where these distributions are both log-concave and "interesting" (in the sense of being hard to approximate well with a Gaussian) is actually pretty small IMO.

That said, for each of these distributions, there exists a change of variables that makes them both log-concave and unconstrained, e.g.:

• Gamma: X ~ Gamma(α, β), Y = log(X), p(y) is log-concave

• Beta: X ~ Beta(α, β), Y = logit(X), p(y) is log-concave

• Dirichlet: Basically same as beta, but inverting the multinomial logistic function is annoying so I won't write it here

• Chi-Squared: Special case of gamma

• Weibull: Just a change of variables on the Gumbel, which is log-concave (and also just a change of variables on the Exponential, a special case of Gamma)

So I would argue that the lack of log-concavity in these distributions arises from looking at them the wrong way. In fact, I'd go further and argue that we don't really know how to construct useful distributions _except_ by changes of variables and compounds (e.g., student-t is just a scale-mixture of normals) applied to a handful of simple, log-concave distributions.

Expand full comment
Ben Recht's avatar

100%. This is a much better articulation of what I was trying to say.

Expand full comment
Matt Hoffman's avatar

Thanks! This topic is a bit of a hobby horse for me :)

Expand full comment
Ben Recht's avatar

What's your intuitive explanation for why log-concave distributions are easy to sample from?

Expand full comment
Matt Hoffman's avatar

Well, they're easy to upper-bound with piecewise-log-linear functions, which means rejection sampling is always a decent option.

And if you've got a good gamma sampler you can turn those samples into most of the things you probably want:

• exponential (set shape=1)

• chi-squared (shape=dof/2)

• normal (via Box-Muller on an exponential, or by square-rooting a single chi-squared and randomizing the sign)

• multivariate normal (just a linear transformation of K independent normals)

• inverse-gamma (just take the reciprocal)

• student-t (sample an appropriate inverse-gamma, square-root and multiply by a normal)

• Dirichlet (sample K gammas and normalize)

• Beta (special case of Dirichlet)

• negative binomial (gamma mixture of Poissons; you do need a separate Poisson sampler though)

I can't think of many distributions you can't sample from by transforming some number of gamma samples—the only exceptions that come to mind are the generalized inverse-Gaussian (a nontrivial generalization of the gamma, inverse-gamma, and inverse-Gaussian) and the von Mises-Fisher.

But I actually wouldn't say that the gamma distribution is particularly easy to sample from, unless you've got access to a good software implementation! There are no simple tricks like Box-Muller, good gamma samplers rely on a cascade of very clever tricks involving obscure properties of beta distributions and such. Thankfully at this point in human history all that complexity is abstracted away behind simple APIs with high-quality implementations.

By contrast, there have been times in my life when I've badly wanted to sample from a generalized inverse-Gaussian (which is log-concave after a logarithmic change of variables), and found it extremely annoying due to the lack of a convenient, performant sampler implementation.

Expand full comment
Andrija's avatar

I always thought that sampling from any given (continuous) distribution is done "straightforwardly" by simply sampling from uniform and then mapping to the distribution space via the inverse-CDF trick. I thought this approach doesn't care about the log-concavity property, and has issues only if there are regions of the distribution domain which have exactly zero probability (since then inverse-CDF is not uniquely defined).

As far as I can see the generalized inverse-Gaussian seems to have a well behaved CDF, so why wouldn't the above approach simply work for it too?

Expand full comment
Matt Hoffman's avatar

This approach seems to be begging the question a bit—if you have a good inverse-CDF implementation, then yes, you have a good sampler.

Trouble is, CDFs aren't always easy to compute (https://en.wikipedia.org/wiki/Generalized_inverse_Gaussian_distribution doesn't offer any closed form, and https://warrenweckesser.github.io/mpsci/distributions/geninvgauss.html resorts to quadrature on the PDF). And inverting them using (say) Newton's method makes the proposition significantly more expensive and less numerically stable.

"Well behaved" sadly doesn't imply "easy to compute".

Expand full comment
Christopher Harshaw's avatar

Hey Ben -- long time subscriber, second time commenter here. Having recently been granted the label of "statistician", I can confirm that statisticians *love* maximum likelihood. But what do you do if the log-likelihood is non-convex? After all, in this case it may not be computationally feasible to obtain your MLE estimate!

I recently heard Richard Samworth give a talk which I really liked. They proposed the following idea in the context of linear regression: you first use your data to estimate the best convex approximation to the log-likelihood, and then you optimize that best convex approximation to get a point-estimate. Pretty cool - figured you might like it too: https://arxiv.org/abs/2403.16688

It feels in line with the spirit of your post (or perhaps the next one) which is: when the objective isn't convex, CONVEXIFY

Expand full comment
Ben Recht's avatar

At some point in the course, we'll get to problems that refuse to be convexified. But overall, I agree.

I'd also add that maximum likelihood is a bizarre fever dream of Fisher that rests on the shakiest of conceptual, philosophical, and mathematical grounds and is never necessary for any data analysis.

Expand full comment
robin's avatar

I'm interested to read more about this. I'm working on a project involving non-convex, multimodal (restricted) maximum likelihood with variance components and it's a struggle.

Expand full comment
Ben Recht's avatar

I wrote a bit about it here: https://substack.com/home/post/p-142826531

The bottom line is there is no theoretical reason you should prefer a maximum likelihood estimator to any other estimator. All of the reasoning was invented post-hoc to justify some very unrigorous arguments of Fisher.

Expand full comment
robin's avatar

Why it's a bizarre fever dream, I mean.

Expand full comment