14 Comments

Just a quick question. Huber loss is differentiable, but the above formulation you relate Huber loss with is not?

Expand full comment

Yes, this is revealing a neat fact used frequently in convex optimization and analysis. For any convex function f(x), the function

f_t(x) = inf_z 0.5 ||x-z||^2 + t f(z)

is differentiable for al values of t. It's called the "Moreau envelope" of f. The Moreau envelope of the absolute value is the Huber loss.

Expand full comment

Wow this is really cool. I work on dictionary learning a lot and people put all sorts of constraint on x. I am in particular interested in hierarchical dictionary learning. And do you know if people in the field of inverse problem deal with some of hierarchical problem as well? For example, in the last part where you introduces a weighing matrix T. Isn't this kind of introduces a "second layer"? Like how you measure y with A result in x. Now you measure x with T, result in another variable, let’s call it z, s.t. x = T^{-1} z. Can set some additional constraint on the second layer variable z? like ||z||_1? This will be similar to the inference in hierarchical dictionary learning for multi-scale signal like imaging. And is it still convex?

Expand full comment

Once we start having to solve for T and x at the same time, our convexity goes out the window, unfortunately. Even the most basic dictionary learning problems are nonconvex problems. I don't have a good explanation for *why* this is where the convexity always breaks down. Let me think about it a bit and maybe that will become another blog post of its own...

Expand full comment

Thanks! To clarify, I think my question is the Inference part of dictionary learning convex? I.e. fix T and A, but we want y = Ax and x = Tz, but where x and z both needs to be sparse. Is this convex?

Expand full comment

I’m really enjoying these posts, Ben. Regrettably, I never got very deep into optimization as it was a bit outside of the more hardware/experimental direction I took my grad school classes at Caltech. Reading your blog reminds me of the fun I had in Joel Tropp’s linear algebra class and the fun and beauty in mathematics, even though I never felt that I was very good at it. I wish more people would blog about their teaching like you do!

If you’re ever down in the LA area, come visit Harvey Mudd so I can buy you lunch :)

Expand full comment

Thank you, Josh!

Expand full comment

I love this because we work on these problems but don't really think about them in these terms. E.g. the first derivative operator reminds me of "slow feature analysis", which posits that neural activity changes slowly with persistent features, like the presence of an object in a scene, invariant to its transformations: https://www.cnbc.cmu.edu/~tai/readings/learning/wiskott_sejnowski_2002.pdf.

Also, could you call all of these denoisers? We sometimes talk about sparse coding this way.

Expand full comment

The term "denoising" is a bit loaded, but I think that it most commonly refers to a source separation problem. You assume that y = signal + noise and want to figure out what is signal and what is noise. In the case of this post, that corresponds to an inverse problem where A is the identity matrix.

I also think of deconvolution or deblurring as removing noise. We can model those problems as y = signal * noise, where the "*" denotes a convolution. None of the formulations in this post can solve those sorts of deconvolution problems. (However, you can pose some deconvolution problems as more complicated convex optimization problems).

Expand full comment

Really appreciate this great overview. Very intuitive.

Expand full comment

FWIW - A million year ago, we needed to calculate coefficients on a series of home-grown Engel-Granger cointegration tests for a prop trading strategy (we were one of the first groups out there looking at coint strats). We weren't comfortable using standard (L2) E-G coefficients and instead developed something based on L1 norms with a bunch of embedded penalty functions to generate our critical values. This approach seemed quite reasonable at the time because strict out of sample testing yielded profitable results with livable sigma. Same held true when we went live. Everything was coded in APL & Gauss and the morning run (of the Dow 30 and probably 2/3 of the S&P) took a little under an hour on an 8086 pc (my guess is that on my current rig, it would be done in the time it took to read this post). Convexity wasn't even a thought when we looked to solve the optimization problem. We went right to Nelder-Mead because sometimes you just gotta get stuff done :-) "But I’ve stopped worrying about this. People have been using the Nelder-Mead method or genetic algorithms since the 1960s" said someone once...

Expand full comment

Not going to disagree with myself there, but if you could solve your problem with a convex solver in milliseconds vs Nelder Mead in an hour, would that change your mind about what to implement? Especially if the convex solver was freeware and easy to use?

https://www.cvxpy.org/

https://cvxgen.com/docs/index.html

Expand full comment

To state the obvious, if I could have snapped my fingers and shaved 59 minutes and 59.99 seconds off the run-rime while guaranteeing the optimal solution rather than just satisfice the objective function - I'm in. But when considering the OR vs. the Optimization aspects of the problem, you need to evaluate the time and costs associated with developing a convex solution. This includes the opportunity costs of waiting for perfection versus going live sooner.

While the hour-long runtime in day-to-day operations wasn't a major issue. We did one run in the am, and the model was based on daily closing equity prices at day t-1. The use of live price feeds and dynamic adjustments to changes in levels was still several years off. The speed of execution became a challenge during development though. Changes often led to "hurry up and wait" scenarios when running analyses on the entire dataset. Although the problem's dimensionality was manageable, we anticipated that Nelder-Mead may have limited us when scaling it.

That said, the penalty functions we used were often stochastic (to enhance the robustness of our search) and typically non-convex. What started as a straightforward deterministic convex optimization problem became significantly more complex. Our primary focus was on the econometrics (factors and the short-run dynamics of time series - which also incorporated long-run equilibrium relationships) and the underlying economic hypothesis we were looking to exploit. While I always lived by the creed of “KISS”, it was precisely the penalty functions that aligned our approach with our economic reasoning (which highlighted the difference between an econometric vs a statistical approach to the problem).

Bottom line, while a millisecond solution would have been ideal, once we got creative with our priors we saw that a non-derivative search method would not only be the only way to solve the problem, but it also enabled us to go live faster once we felt there was some good alpha to be teased out of the data. Had we not been shut down due to the firm being taken over and us moving on, we would have aimed to simplify further and looked to scale the problem while assessing the shadow prices (in this case the sensitivity to constraints) associated with the transition from convex to non-convex solutions.

Sorry to have droned on (and to have gotten off topic)- I haven’t thought about this topic in years and the Giants’ game doesn’t start until 4:15pm

Expand full comment

I would see this lecture as unifying insights on many topics, if I were enrolled to the course.

I feel that a relatively strong linear algebra is a must even when you are dealing with basic OLS problems. Students get used to looking at the problem from a geometric or pictorial viewpoint with linear algebra and it also pays off in some convex opt. problems. I see linear algebra more like training wheels while learning how to derive. I think students should have fluency to the level of say R(A) orthogonal to N(A^T) before this course, not very basic linear algebra.

Also, in addition to all information in this lecture, I would even try to give a probabilistic view to the penalty functions introduced as likelihood and prior, i.e., OLS set-up is simply the max. Likelihood problem for the white Gaussian noise “w” vector and non-random “x” vector. Ridge regression is the Bayesian estimation with Gaussian distributed prior and likelihood. I am sure this would lose a few more in the audience; but, may also do good for some students. The lost ones will hopefully appreciate these comments later. They are not difficult at all…

One final comment, I was amazed with CVX examples on the official CVX website especially simple ones on filter design and other linear problems. It is possible to direct students to CVX examples for experimentation.

Expand full comment