arg min

I have no dog in this fight!

Expand full comment

Sean O'Rourke

Sep 17

With respect, the only difference in the formula of a derivative between row and column-major representations is a product with a commutation matrix on the left -- any other forms of the derivative are incorrect, and Jan Magnus has taken significant pains to correct prior errors (including as recently as *this year* in the International Linear Algebra Society's bulletin [1]).

As to Ben's original point, the intuition of the Hessian is correct. A great walk-through proof of this particular derivative/Hessian can be found in the excellent book [2] by the (tragically) late Are Hjørungnes; it can be found across pages 50-53. He also reminds us that, for real valued functions of complex parameters, we should use derivatives w.r.t. the *complex conjugate* of a variable for gradient descent (see Theorem 3.4, pp 62-63). I highly recommend the book for anybody studying convex optimzation -- I couldn't have finished my Ph.D. without it (and Boyd & Vandenberghe, of course).

[1] J. R. Magnus, “Matrix Derivatives: Why and Where Did It Go Wrong?,” IMAGE: The Bulletin of the International Linear Algebra Society, no. 72, pp. 3–8, 2024, [Online]. Available: https://ilasic.org/wp-content/uploads/IMAGE/image72.pdf

[2] A. Hjørungnes, Complex-Valued Matrix Derivatives With Applications In Signal Processing And Communications. Cambridge University Press, 2011.

Expand full comment

Yaroslav Bulatov

Sep 20

What do you think about graphical notation? The difficulty with standard approach is that it requires flattening all variables. What if your variable is a matrix or a 3-tensor? You get different formulas depending on how you perform the flattening, the order matters for efficiency. We can defer this issue by keeping all the original indices

Expand full comment

Mahyar Fazlyab

Sep 13Edited

I can completely relate to this as I am teaching convex optimization this semester. I was proving the concavity of logdet the other day. Everyone was completely lost even math students.

After teaching convex optimization over the past four years, I have identified a trend: students background in linear algebra is getting worse every year; and they are more inclined to know the end application before caring about the foundations.

This course is getting more and more challenging to teach! Would appreciate your suggestions.

Expand full comment

Do you have any thoughts for why linear algebra backgrounds are getting worse even as undergrad programs add more and more linear algebra to their curricula?

Expand full comment

Alex

Amazing post. I am pretty mediocre at Linear Algebra but know it is a "magical" skill (almost like knowing how to code). The more resources, the better. Thanks!

Expand full comment

Amit Chakrabarti

Oct 11

What a terrific post! Thank you.

For more fun, yet another way to prove log-concavity of determinant is via information theory, by considering the (differential) entropy of a random mixture of independent random matrices drawn from N(0,A) and N(0,B), where A and B are PSD matrices. This proof is given in Cover & Thomas, chapter 17. Admittedly, when you look at what precisely you need from information theory, this proof isn't all that undergrad-friendly!

Expand full comment

Nathan Lambert

Sep 15

Fun post. I need to go back and study for the linear systems prelim again… could barely follow 😅

Expand full comment

Sep 15

It's never a bad time to brush up on linear algebra. But I also was being deliberately difficult to add some polemical seasoning.

Expand full comment

Jack Hanson

Sep 26

I recently started to review my undergraduate linear algebra course, but I took it a long time ago and my text is fairly old. Do you have any suggestions on good reference materials that also include some introduction to graduate-level work? My degree is in mathematics, but I wasn’t exposed to as much linear algebra as is probably the case now.

Expand full comment

Adam Ginensky

A^2 makes sense if A is a square matrix.

Expand full comment

Right! Another reason why linear algebra is always harder than we want it to be. Even for square matrices, A^2 is a weird operation.

Expand full comment

Tim Duff

Sep 13Edited

IMO, the weirdness of matrix derivatives that you mention presents a good argument for understanding things in a more general / coordinate free setting. Namely, once one accepts a more general definition (https://en.wikipedia.org/wiki/Fréchet_derivative), then the derivative at a point is just a linear operator between two normed vector spaces. Since the space of all such linear operators is itself a normed vector space, you get all higher derivatives from this definition for free. For example, the second derivative at a point linearly maps a point to a linear operator, corresponding to the Hessian bilinear form in the usual way (or, more abstractly, https://en.wikipedia.org/wiki/Tensor-hom_adjunction). Of course one has to coordinatize everything in the end, hence the appearance of Kronecker products, but it's a good mental framework nonetheless!

Expand full comment

Sep 13Edited

It's funny Tim, because I do tend to prefer coordinate free arguments, but Yaroslav upthread is arguing that coordinates make life easier for implementation. I think we need both.

Expand full comment

Tim Duff