Continuing our journey through my scatterplot, I want to stay on the y-axis today and explore slightly more complicated problems where actions don’t influence outcomes. Though the theoretical literature suggests algorithms and analyses need to be more complicated, intuition and practice suggest the opposite. Adding side information makes these decision problems easier and allows very naive methods to shine.
Contextual Online Optimization plays a similar repeated game as in the multiarmed bandit. The goal is still maximizing total aggregate return, but now we can gather side information that might improve our performance. Before each action, we gain access to some data that might inform our decision. We then act and actualize some return. The algorithm can now use the past stream of returns, actions, and side information to make its next decision.
data, action, return, adaptation, data, action, return, adaptation, data, action, …
We will again assume that the actions played in the past don’t influence future data or returns. What are some applications of this flavor? Building on yesterday’s application, we could imagine an adaptive clinical trial where we consider different features of each patient that enters the trial and attempt to assign them to the best arm based on these characteristics. For better or for worse (mostly for worse), engagement problems on the internet are easy to pose as online contextual optimization. A tech company can take a person’s interaction history and, based on how similar they are to other app users, try to display the most engaging advertisement to maximize click-through rates. Sometimes I’m compelled to list realistic examples even if I don’t like them.
There is an overabundant supply of algorithms for contextual online optimization. Bietti, Agarwal, and Langford survey methods from the perspective of the bandit community, and Sadana et al. survey the myriad methods developed by the operations research community. But today, I will focus only on a single algorithm, one with terrible theoretical properties but surprisingly good performance in practice:
for each round:
- use the past data and the current context
to predict the return of all possible actions
- choose the action that has the highest predicted return
This algorithm is known as the “greedy algorithm” in the online learning literature. It does no uncertainty quantification. At every stage, it builds a predictor (using pytorch or whatever) from the past. It assumes the past is indicative of the future and treats the predictions as if they were true.
Why might the greedy algorithm work for contextual optimization? Max Simchowitz shared a high-level intuition with me a few years ago. I’ll try to explain it without equations, but if you’re interested in learning more, Moritz and I later wrote this in detail in Chapter 12 of Patterns, Predictions, and Actions. First, because the policy is greedy, we know that the predicted return of the greedy action is larger than the predicted return of the optimal action. We’re choosing the greedy action precisely because it has the highest predicted return. This means that the true return of the optimal action minus the true return of the greedy action is at most twice the prediction error. If your prediction errors are small, the greedy action must be almost as good as the optimal action.
A good predictor is all you need for good online contextual optimization, but can you find a good predictor while running the greedy method? You might imagine that this greedy method focuses too much on past returns and would be vulnerable to unfamiliar data. In the worst case, this is definitely true, and there are contrived sequences of contexts and returns that defeat the method. For example, imagine if the context is the same at every round. In this case, you really just have a multi-armed bandit problem. The data is then useless, and the greedy algorithm can’t work.
But, as with most worst-case analyses, this counterexample is remarkably contrived. People aren’t going to be using contextual information unless they believe the contexts are at least modestly predictive of return. In realistic applications, does the greedy algorithm still work? Yes. Not only does it work, but it’s about as good as any other algorithm in theory and practice. In the comparison of methods I mentioned by Bietti and collaborators, the greedy method performed exceptionally well on an extensive suite of benchmarks. And Sampath Kannan and colleagues showed that for an important class of contextual optimization problems, the greedy method was nearly optimal for most instances.
Once again, we see that if we can act frequently and incorporate sufficient information, simple algorithms shine. The greedy method is an example of certainty equivalence, where we take an uncertain, data-driven model and just run with it as if it were true. We don’t quantify uncertainty, we completely disregard it. We just hope that if we’re wrong, we’ll learn quickly to correct our errors. Certainty equivalence always feels wrong, but it works unreasonably well. I’ll highlight its surprising power over the next several blogs.
It would be nice if the greedy method was indeed optimal. If you're running a recommender system, just recommend greedily and you'll automatically discover which content people like! I think this is too good to be true, unfortunately. If you act greedily, you really will insufficiently explore your catalogue of options.
I think the papers that claim otherwise are interesting, but make assumptions that are less innocuous than they appear. For instance, the assumptions in this paper rule out inclusion of an intercept term, or nested categorical variables, in a linear regression.
https://web.stanford.edu/~bayati/papers/greedy.pdf
The link pointing to your book is broken.