On Monday, I riffed on a fake plot of decision problems based on the impact and frequency of actions. I focused on applications on Monday but could make a similar plot for optimization methods.

I plan to spend a few blogs going through some of these theoretical archetypes in depth to pick apart what separates them and what ties them together. These writings are an attempt to dispel some of my confusion about stochastic optimization that came up in last semester’s machine learning class. I’m also going to try to figure out what to put in that region marked “**???**.”

Let’s start with bandit problems. The worst named class in all of optimization.

In a bandit problem, the goal is to maximize a total aggregate return. We will have several rounds of action. After each action, some return is actualized. The algorithm then has to decide if it wants to change its action or stick with the current action. The actions played have no impact whatsoever on the returns. The hope is that we can adapt to the past results to achieve optimal future returns.

action, return, adaptation, action, return, adaptation, action, return, adaptation, …

The simplest bandit problem is called the *two-armed bandit*. You have only two actions to choose from to maximize your returns. The name comes from trying to pick between two slot machines to find the machine with the highest expected payout. But there are important non-gambling applications here. Think of doing an adaptive clinical trial. With each enrolled patient in the study, we have to choose whether to assign them to the treatment arm or the control arm. Based on the past outcomes of patients, can we be smart about how we assign the next patient to maximize their chance of benefit?

Assuming that every scenario we encounter is *independent and identically distributed*, there’s a straightforward algorithm for the 2-armed bandit. Since we assumed that our actions do not impact future outcomes, we can rephrase the problem as an adaptive clinical trial. We first pick a cohort size to experiment upon. We gather this cohort and assign arms at random. After the trial, we can look at the average return of each action. If Arm A is clearly better than Arm B, we will run with Arm A and discard Arm B, and vice versa.

If you can’t tell which action is better after the first trial, you have a choice. You can decide that both are equally good. In this case, just pick whichever action suits you best. But if you’d rather get more information, double your trial cohort size and see if that distinguishes the two actions. If your initial cohort was size 100, the following will have 200 subjects. By continuing this doubling process, it will eventually be clear which action is best.

This method of iteratively growing randomized trials is not only intuitive, but it has a strong theoretical grounding. And the theory depends on confidence intervals to prove this algorithm works and set the decision points correctly. Probability theory will tell you when to keep exploring versus committing to a particular action. More or less, you construct confidence intervals after each trial period and test if they cross. If they cross, you run a new trial with twice as many units.

The bandit theory is helpful as it gives insights into how to set your cohort size. If you have a rough idea of how different A and B need to be so that choosing one matters (*coughs, power calculation)*, and you have a rough idea of how many subjects you’d ever be able to enroll, then the theory will tell you what the cohort sizes should be and also how different your estimated returns need to be in order to choose an arm.

But I want to reiterate an important point here. If after a hundred actions, it’s clear that A is better than B, then getting the tightest confidence interval correct doesn’t buy you that much. If after a million actions, it’s *not* clear whether A is better than B, then they’re equally good for all intents and purposes, and you’d be fine picking either one. In either case, here, precise uncertainty quantification adds marginal benefit at best.

With the no-impact assumption—that we can act indefinitely without disturbing the world—bandit theorists have shown that you can remove the annoying assumption that returns are iid. They have constructed bandit algorithms for when the returns are non-random and ordered arbitrarily. This is called the “adversarial setting,” and the associated algorithm is a version of stochastic gradient descent. In this case, you can find a policy that’s almost as good as if you saw all of the counterfactual possibilities of your potential actions in advance.

That you can design such powerful algorithms shows that the top corner of my imaginary scatter plot is probably the easiest spot. If you can act frequently and without consequence, you can find optimal actions under a variety of settings with no explicit uncertainty quantification. In the next blog, I’ll describe a more complicated bandit setting, the contextual bandit problem. We’ll see how, even in this more challenging scenario, a simple algorithm based solely on prediction is almost always optimal.

Why does stochastic programming with recourse (necessarily) have low impact? It depends on the resource structure modeled in the instance, no?

Loving the gradient this blog is following :)