Regretfully Yours

Sep 07, 2023

This is a live blog of Lecture 5 of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.

There’s no faster way to suck out the feeling of a lecture than an unintuitive optimization convergence analysis. I’m sure this will similarly kill engagement on this post. But I’ve decided to go all in for a day, and there will be equations.

I find it easy to motivate why descent algorithms converge. If my scheme to minimize a function decreases the function value at each step, it seems pretty reasonable that the algorithm will eventually make enough progress to get to the minimal value. But algorithms like Stochastic Gradient Descent aren’t guaranteed to decrease your function value. Why do they converge?

One of the most popular techniques uses an analysis attributed to Nemirovskii and Yudin’s 1983 masterwork on optimization complexity. I have a deep and long love-hate relationship with their argument. The foundation of their analysis has nothing to do with algorithms but instead relies on an elementary inequality in Euclidean space. Anyone who has seen inner products will understand the following derivation. But no one has ever explained to me why we do this derivation.

Let u₀, u₁ and, v₁,v₂,...,v_T also be any vectors. Define u_t+1 = u_t-v_t. Then we always have

\(|| u_{t+1} - u_0||^2 = || u_t - u0 ||^2 - 2 \langle v_t, u_t-u_0 \rangle + ||v_t||^2\)

This is just using how we defined the sequence u_t and then expanding the norm. If I continue to unwrap this expression, I’ll find

\(|| u_{t+1} - u_0||^2 = || u_1 - u_0 ||^2 - 2 \sum_{t=1}^T \langle v_t, u_t-u_0\rangle + \sum_{t=1}^T ||v_t||^2\)

Now, the left-hand side is a square, so it’s nonnegative. I can then rearrange terms and get

\(\sum_{t=1}^T \langle v_t, u_t-u_0\rangle \leq \frac{1}{2} || u_1 - u_0 ||^2 + \frac{1}{2} \sum_{t=1}^T ||v_t||^2\)

This one simple inequality has spored ten thousand papers in optimization.

But this inequality has nothing to do with optimization. It has something to do with sequences in Euclidean space. But barely anything. u₀, u₁, and the v’s are completely arbitrary here. And yet, I can use it to prove that Stochastic Gradient Descent converges with only a couple of extra lines.

I don’t want to belabor the details here, but you can find the rest of the derivation in Chapter 5 of PPA. If the v_t are stochastic gradients in a linear classification problem where the loss function is convex and has bounded gradients, then if you choose the stepsize correctly, the final expression reads

\(\frac{1}{T} \sum_{t=1}^T loss ( \langle w_t, x_t \rangle, y_t) - loss ( \langle w_*, x_t \rangle, y_t) \leq \frac{||w_*|| D}{\sqrt{T}}\)

Here D is an upper bound on the norm of the feature vectors x_i. Let me unpack what this bound says for machine learning.

First, the inequality measures the difference between the prediction quality of the iterate at step t and the optimal classifier. The comparison is only made on the current data point. It says that, on average, the difference between the prediction quality at step t and the best you could do if someone had handed you the optimal classifier is about 1 over the square root of T (1/√T) In particular, the average is going to zero. So it says that if I were to run sequential gradient descent on an infinite data set, the gap between the SGD predictor and the optimal predictor goes to zero. The inequality does not care which order the data steams through the algorithm. This inequality makes no assumptions about randomness. This bound holds for any ordering of the data you could imagine.

Second, the upper bound itself is very curious. Let me explain by connecting back to the Perceptron. We can interpret the mistake bound for the Perceptron as a sequential prediction bound. The mistake bound is equivalent to

\(\frac{1}{T} \sum_{t=1}^T err ( \langle w_t, x_t \rangle, y_t) - err ( \langle w_*, x_t \rangle, y_t) \leq \frac{||w_*||^2 D^2}{T}\)

Where err is equal to 1 if the prediction is correct and 0 if the prediction is incorrect. w_* is the max-margin hyperplane in this case. For the Perceptron, the error of w_* is always zero, but I’m belaboring the form of this expression so that we can compare the bounds. When data is separable, the sequential prediction error is the square of when the data is not separable. The term

\(||w_*|| D \)

governs the convergence. In the case of the perceptron, this term is precisely the margin. And if you dig through ML theory papers to look for guarantees on high-dimensional prediction problems, this term pops up. Given our current mental models of how prediction works, it seems to be fundamental and unavoidable.

Finally, as far as optimization bounds go, 1/√T is really bad. Steve Wright, inspired by one of his colleagues at NC State, always said that 1/√T is a negative result. I agree! It’s hard to find an algorithm that converges more slowly than 1/√T. There was an annoying trend in the optimization-for-machine learning heyday where people would prove that 1/√T rates are optimal for machine learning. These proofs would involve constructing some insane data sequence that would never occur in reality. But they convinced people that we had to settle for impossibly slow algorithms. And we’re now locked into incredibly inefficient algorithms because Google and Microsoft are willing to fight over who has more cloud capacity. This isn’t a good trend for the rest of us.

PS What does the title have to do with the rest of this post? This average gap between the sequential predictions and the best predictor is called regret. Regret is a term that’s pulled from gambling metaphors and is my least favorite term in optimization and machine learning. But I’ll yell about that on some other day.

arg min

Discussion about this post