Convex Optimization in the Age of LLMs
Kicking off some live blogging of my fall semester course.
In 1948, at the meeting of the Econometric Society at the University of Wisconsin, Madison, George Dantzig first presented his formulation of Linear Programming. Dantzig proposed something radical at the time. Posing problems as goal maximization wasn’t particularly novel, nor were simple algorithms for finding extreme points. But no one had an algorithm for maximizing a goal subject to a complex model constraining potential solutions. Dantzig found he could model all sorts of important planning programs–whether it be the transportation of goods, assignment of people to tasks, or computing shortest paths–as maximizing linear functions subject to linear inequality constraints.
Dantzig’s breaktrhough was showing that any such optimization problem could be solved by the same algorithm. His Simplex Method worked by solving an appropriate sequence of linear equations, like the ones we torture our high school students with. As long as you could model a problem using linear equations and linear inequality constraints, Dantzig’s algorithm could find the solution.
After his talk, the first question from the audience was more of a belligerent comment. Mathematical statistician Harold Hotelling stood up and blustered, “We know the world is nonlinear." He’s not wrong, of course. Nonlinear just means “not linear,” and it’s pretty clear that many (most?) worldly phenomena are not well-modeled by linear functions. Any curvy function isn’t linear. But before Danztig could reply, John von Neumann interjected, ”The speaker titled his talk 'Linear Programming.’ Then he carefully stated his axioms. If you have an application that satisfies the axioms, use it. If it does not, then don’t.”
A second important question unasked by Hotelling was how you could solve problems that actually mattered. In middle school, we all learn linear functions with their slopes and intercepts. In high school, we learn to solve systems of equations with three variables. Dantzig’s practical planning problems required solving problems with far more than three variables. What if you had 10? That would be torture to do by hand. Dantzig knew that in order to make linear programming practical, he’d need someone to build a machine to solve them.
Dantizg admitted as much in his talk. The last sentence of his conference abstract should sound familiar to all computer scientists. “It is proposed that computational techniques such as those developed by von Neumann and the author be used in connection with large-scale digital computers to implement the solution of programming problems." This is 1948, the nascent days of the ENIAC. Even before we had computers, we wanted computers to be “large-scale.” The only truism of computer science is that current computing is insufficient to solve the most interesting problems. The right amount of computing is always just a bit more than what we currently have access to. If there’s anything that defines the discipline of computer science, it’s constantly promising that bigger computers will solve all of our problems. This self-fulfilling prophecy is one of the few things we’ve consistently gotten right.
To Dantzig’s credit, he’d spend the next half a decade as a program manager at the Air Force funding projects to make the modern computer a reality.
The basis of this class stems from lessons from 1948. Dantzig identified a class of generically solvable optimization problems. If you could model your problem in a way that obeyed his axioms, then Dantzig had a way for you to find a solution assuming you had access to a powerful enough computer. This course, too ambitiously titled Convex Optimization, is the logical end of Dantzig’s program.
This course celebrates the 20th anniversary of the publication of the seminal text Convex Optimization by Stephen Boyd and Lieven Vandenberghe. I would need a whole post to describe how important this book was to my research. I have had a copy of the pdf of this book on my laptop desktop since 2003, and I still frequently find myself consulting it.
Stephen and Lieven are up front with their motivations in the preface: the book wants to describe a class of optimization problems that have a unified modeling language and can be efficiently solved by iterative linear equation solving. I’ll explain this more in future blogs, but convex optimization problems are effectively the “limit” of what you can do with linear models. This limit will yield some nonlinear problems, and my goal is to explain what those problems are. This is what I mean by the course encapsulating the logical end of Dantzig’s Linear Programming program.
Now, given the frenzied interest in everything deep learning, how much of this book remains relevant? To paraphrase Hotelling again, why should anyone learn about convex optimization when all of the exciting problems in the world are, so we are told, non-convex? In this course, we will take the perspective of John von Neumann, who defended Dantzig and linear programming from Hotelling’s broadside, "If you have an application that satisfies the axioms, use it. if it does not, then don’t."
So we’ll proceed along the von Neumann path: first, what are the axioms of convex programming? The first part of the course covers these modeling axioms, showing how to build and recognize these models. Linear programs are a special case, and we’ll achieve far more generality than that. However, the achievable limits are not quite as simple to state as they are in linear programming. What we get instead are sufficient building blocks. We begin with simple primitives that are solvable and play a sort of optimization lego, plugging together things until they model what we want. If we use the lego rules, we can be assured that a computing tool will find us an optimal solution.
The second part of the course focuses on applications, and we find many in decision-making, planning, data analysis, statistics, machine learning, control theory, and signal processing.
Finally, in the third part of the class, we’ll cover some algorithms. In the age of LLMs, “solve” has taken on a weird meaning of “works good enough I guess.” That’s not what I mean by solve. I will describe algorithms that find solutions of provably optimal value in your model (i.e., minimum or maximum, depending on which you care about).
I’ll blog my way through the class, again trying to keep the mathematics to a minimum on here. Stephen and Lieven have a great set of more technical materials. Beyond their book, there are lecture slides, videos, and many additional exercises to solve. There is free software that runs in python. I’ll point you to those as we go.
Part of my goal in this class is to convince you that the solution of large, constrained optimization problems is a vastly underappreciated technological achievement of the 20th century. Optimization runs everything around me. We should understand what’s happening under the hood. The central tenet is there are efficient computational solutions to properly modeled optimization problems. Maybe we don’t care about guaranteed solutions or efficient running times anymore as we train gigascale neural models in massive GPU data centers. That could be the case! But it’s worth understanding this central contribution of computer science before we leave it behind to feel the AGI.
Thank you for letting us know the old and interesting story, very thoughtful article.
Do you by any chance have a reference for the opening anecdote? I’d be very interested to read more.