Derivatives in terms of Kronecker products is a bad idea. I say this after reading Magnus/Nuedecker and using their notation for a couple of years. Turning everything into a matrix adds a "flattening" step, and your formulas will be different depending on whether you use row-vec or col-vec operator to flatten. Math convention is col-vec, but GPU tensor layout is row-vec, so you end up with "wrong order" formulas propagating into slow implementations (ie, in https://github.com/tensorflow/kfac) . Alternative is to keep the indices. Derivative w.r.t. matrix variable has 2 indices. Hessian has 4 indices. If you don't want to come up with index names, can use graphical notation as in https://github.com/thomasahle/tensorgrad
With respect, the only difference in the formula of a derivative between row and column-major representations is a product with a commutation matrix on the left -- any other forms of the derivative are incorrect, and Jan Magnus has taken significant pains to correct prior errors (including as recently as *this year* in the International Linear Algebra Society's bulletin [1]).
As to Ben's original point, the intuition of the Hessian is correct. A great walk-through proof of this particular derivative/Hessian can be found in the excellent book [2] by the (tragically) late Are Hjørungnes; it can be found across pages 50-53. He also reminds us that, for real valued functions of complex parameters, we should use derivatives w.r.t. the *complex conjugate* of a variable for gradient descent (see Theorem 3.4, pp 62-63). I highly recommend the book for anybody studying convex optimzation -- I couldn't have finished my Ph.D. without it (and Boyd & Vandenberghe, of course).
[1] J. R. Magnus, “Matrix Derivatives: Why and Where Did It Go Wrong?,” IMAGE: The Bulletin of the International Linear Algebra Society, no. 72, pp. 3–8, 2024, [Online]. Available: https://ilasic.org/wp-content/uploads/IMAGE/image72.pdf
[2] A. Hjørungnes, Complex-Valued Matrix Derivatives With Applications In Signal Processing And Communications. Cambridge University Press, 2011.
What do you think about graphical notation? The difficulty with standard approach is that it requires flattening all variables. What if your variable is a matrix or a 3-tensor? You get different formulas depending on how you perform the flattening, the order matters for efficiency. We can defer this issue by keeping all the original indices
Amazing post. I am pretty mediocre at Linear Algebra but know it is a "magical" skill (almost like knowing how to code). The more resources, the better. Thanks!
I can completely relate to this as I am teaching convex optimization this semester. I was proving the concavity of logdet the other day. Everyone was completely lost even math students.
After teaching convex optimization over the past four years, I have identified a trend: students background in linear algebra is getting worse every year; and they are more inclined to know the end application before caring about the foundations.
This course is getting more and more challenging to teach! Would appreciate your suggestions.
Do you have any thoughts for why linear algebra backgrounds are getting worse even as undergrad programs add more and more linear algebra to their curricula?
For more fun, yet another way to prove log-concavity of determinant is via information theory, by considering the (differential) entropy of a random mixture of independent random matrices drawn from N(0,A) and N(0,B), where A and B are PSD matrices. This proof is given in Cover & Thomas, chapter 17. Admittedly, when you look at what precisely you need from information theory, this proof isn't all that undergrad-friendly!
I recently started to review my undergraduate linear algebra course, but I took it a long time ago and my text is fairly old. Do you have any suggestions on good reference materials that also include some introduction to graduate-level work? My degree is in mathematics, but I wasn’t exposed to as much linear algebra as is probably the case now.
IMO, the weirdness of matrix derivatives that you mention presents a good argument for understanding things in a more general / coordinate free setting. Namely, once one accepts a more general definition (https://en.wikipedia.org/wiki/Fréchet_derivative), then the derivative at a point is just a linear operator between two normed vector spaces. Since the space of all such linear operators is itself a normed vector space, you get all higher derivatives from this definition for free. For example, the second derivative at a point linearly maps a point to a linear operator, corresponding to the Hessian bilinear form in the usual way (or, more abstractly, https://en.wikipedia.org/wiki/Tensor-hom_adjunction). Of course one has to coordinatize everything in the end, hence the appearance of Kronecker products, but it's a good mental framework nonetheless!
It's funny Tim, because I do tend to prefer coordinate free arguments, but Yaroslav upthread is arguing that coordinates make life easier for implementation. I think we need both.
Derivatives in terms of Kronecker products is a bad idea. I say this after reading Magnus/Nuedecker and using their notation for a couple of years. Turning everything into a matrix adds a "flattening" step, and your formulas will be different depending on whether you use row-vec or col-vec operator to flatten. Math convention is col-vec, but GPU tensor layout is row-vec, so you end up with "wrong order" formulas propagating into slow implementations (ie, in https://github.com/tensorflow/kfac) . Alternative is to keep the indices. Derivative w.r.t. matrix variable has 2 indices. Hessian has 4 indices. If you don't want to come up with index names, can use graphical notation as in https://github.com/thomasahle/tensorgrad
I have no dog in this fight!
With respect, the only difference in the formula of a derivative between row and column-major representations is a product with a commutation matrix on the left -- any other forms of the derivative are incorrect, and Jan Magnus has taken significant pains to correct prior errors (including as recently as *this year* in the International Linear Algebra Society's bulletin [1]).
As to Ben's original point, the intuition of the Hessian is correct. A great walk-through proof of this particular derivative/Hessian can be found in the excellent book [2] by the (tragically) late Are Hjørungnes; it can be found across pages 50-53. He also reminds us that, for real valued functions of complex parameters, we should use derivatives w.r.t. the *complex conjugate* of a variable for gradient descent (see Theorem 3.4, pp 62-63). I highly recommend the book for anybody studying convex optimzation -- I couldn't have finished my Ph.D. without it (and Boyd & Vandenberghe, of course).
[1] J. R. Magnus, “Matrix Derivatives: Why and Where Did It Go Wrong?,” IMAGE: The Bulletin of the International Linear Algebra Society, no. 72, pp. 3–8, 2024, [Online]. Available: https://ilasic.org/wp-content/uploads/IMAGE/image72.pdf
[2] A. Hjørungnes, Complex-Valued Matrix Derivatives With Applications In Signal Processing And Communications. Cambridge University Press, 2011.
What do you think about graphical notation? The difficulty with standard approach is that it requires flattening all variables. What if your variable is a matrix or a 3-tensor? You get different formulas depending on how you perform the flattening, the order matters for efficiency. We can defer this issue by keeping all the original indices
Amazing post. I am pretty mediocre at Linear Algebra but know it is a "magical" skill (almost like knowing how to code). The more resources, the better. Thanks!
I can completely relate to this as I am teaching convex optimization this semester. I was proving the concavity of logdet the other day. Everyone was completely lost even math students.
After teaching convex optimization over the past four years, I have identified a trend: students background in linear algebra is getting worse every year; and they are more inclined to know the end application before caring about the foundations.
This course is getting more and more challenging to teach! Would appreciate your suggestions.
Do you have any thoughts for why linear algebra backgrounds are getting worse even as undergrad programs add more and more linear algebra to their curricula?
What a terrific post! Thank you.
For more fun, yet another way to prove log-concavity of determinant is via information theory, by considering the (differential) entropy of a random mixture of independent random matrices drawn from N(0,A) and N(0,B), where A and B are PSD matrices. This proof is given in Cover & Thomas, chapter 17. Admittedly, when you look at what precisely you need from information theory, this proof isn't all that undergrad-friendly!
Fun post. I need to go back and study for the linear systems prelim again… could barely follow 😅
It's never a bad time to brush up on linear algebra. But I also was being deliberately difficult to add some polemical seasoning.
I recently started to review my undergraduate linear algebra course, but I took it a long time ago and my text is fairly old. Do you have any suggestions on good reference materials that also include some introduction to graduate-level work? My degree is in mathematics, but I wasn’t exposed to as much linear algebra as is probably the case now.
A^2 makes sense if A is a square matrix.
Right! Another reason why linear algebra is always harder than we want it to be. Even for square matrices, A^2 is a weird operation.
IMO, the weirdness of matrix derivatives that you mention presents a good argument for understanding things in a more general / coordinate free setting. Namely, once one accepts a more general definition (https://en.wikipedia.org/wiki/Fréchet_derivative), then the derivative at a point is just a linear operator between two normed vector spaces. Since the space of all such linear operators is itself a normed vector space, you get all higher derivatives from this definition for free. For example, the second derivative at a point linearly maps a point to a linear operator, corresponding to the Hessian bilinear form in the usual way (or, more abstractly, https://en.wikipedia.org/wiki/Tensor-hom_adjunction). Of course one has to coordinatize everything in the end, hence the appearance of Kronecker products, but it's a good mental framework nonetheless!
It's funny Tim, because I do tend to prefer coordinate free arguments, but Yaroslav upthread is arguing that coordinates make life easier for implementation. I think we need both.
As a self-identified 65% mathematician and 35% percent computer scientist, I completely agree!
Nice article. For a non positive def matrix, the determinant is still the product of the eigenvalues
These notes look amazing, thanks for sharing.
Another resource your students might find helpful is https://www.matrixcalculus.org/
I've used it many, many times when working on problems involving this sort of calculations.
Computer scientist prescribed Linear Algebra math barely scratches the surface of Linear Algebra as taught by Mathematics departments.