The bandits framework highlights how hard it is to pose the problem of decision making under uncertainty. We act as though there is a reasonable metric to judge our actions in hindsight and then try to optimize this in foresight. Maybe I’ve just philosophically backed myself into a dumb corner on this, but I can’t get around how we take for granted that policy optimization problems aren’t well-posed.

Maybe let me motivate this by describing well-posed optimization problems. There are cut and dry answers to optimization problems when we don't have uncertainty. We seek the member of a set of acceptable options that minimizes a criterion we care about. For example, “Find the shortest person in the room” is a well-posed problem with a verifiable answer. I could list thousands of examples of optimization problems like this, from finding the shortest path on a map to finding minimum-cost diets.

But decision making under uncertainty has a radically different flavor. Here, we aim to minimize a cost we don’t know in advance. Our goal is to select the best action with incomplete information. This necessarily means our problem isn’t well-posed. And it leads us into the zoo of confusing formulations we’ve been covering in class. Let me expand on a few and drag you into my well of confusion.

**Stochastic Optimization**. In stochastic optimization we aim to minimize the *expected* cost of a decision. We choose a policy that picks an appropriate action from accessible observations. In math, we let π denote the policy, *x* the observed data, *a* the action, and *w* the *stochastic uncertainty,* and try to solve the problem.

Here, we assume (i) we can write down the cost if we know the action, the observation, and the realization of the uncertainty, (ii) we know that the uncertainty arises by sampling from a probability distribution. To devise a solution, we need to model this probability distribution.

Both of the ways I know to model the probability distribution are restrictive. The first explicitly builds a model. If the “probability” is a model of random noise in a network of sensors, perhaps we can characterize this explicitly and repeatedly. But if the probability distribution is a model of “people coming into a clinic,” this seems far less plausible.

**Stochastic Regret. **The second way to incorporate probability assumes that we want to solve this optimization problem multiple times and the probability model is the same every time. In this case, we can solve the problem sequentially, updating our action with every instance until we get the answer we want. This is complicated because our policy at the beginning will be much worse than the policy at the end.

The common convention is to judge the policy and the algorithm for finding it in terms of *regret*. Let’s say you try the policy several times. You can sum up the total cost accrued over the trials. You can then ask what the best policy would have been had you known the random noise beforehand. Regret is the difference between the actual cost accrued and the best cost you could have achieved had you known the randomness. This is exactly what we computed in class on Thursday for the multi-armed bandit.

The downside of this stochastic regret model is it still assumes repeated interaction with the *same* probability distribution. Perhaps this is true of slot machines in a casino (the problem that inspired the multi-armed bandit in the first place). But is this repeated probability distribution true for the stock market?

**Robust Optimization*** . *What if we just drop the probabilistic facade? Is it easier to model all possible realizations of the uncertainty and then plan for the option that makes our life the hardest?

In math, assume there is some set *W* where the uncertainty can lie. The problem is then to

People don't like robust optimization because it tends to be very conservative. If you are always planning for the absolute worst case, then more often than not, you won’t do anything at all. If we believe there’s some chance we can do better than the worst case, we’d like the option to take that chance. The probabilistic models we use formalize excuses to let us gamble. But perhaps there is a way to allow for some hope in our decision-making while not relying on a concrete probabilistic model of reality.

**Distributionally Robust Stochastic Optimization**. Here’s one possible way to balance robustness and stochastics. Assume there is some stochastic model of our next encounter, but we only have a rough notion of the probability distribution. We still assume that the uncertainty is probabilistic, but now we assume that this probability distribution is unknown. We’ll restrict our attention to a *set* of probability distributions and choose the action to minimize cost if nature behaves as the worst possible distribution in our set.

Again, in math, assume we've generated enough data to declare that the true distribution is in some set of distributions *P* and solve

Now, this could be very general. The probability distribution specification here could be something like “all distributions whose mean and variance lie in some range.” But the challenge then is whether we can solve these formulations efficiently. And does it give us a reasonable course of action? My issue with distributionally robust stochastic optimization is that I haven’t seen a “killer app” yet that shows this formulation in action. Please share one if you have it!

**Non-stochastic bandits.** The weird thing about algorithms designed for stochastic optimization is they seem to work well on arbitrary sequences. I am unable to wrap my head around this. We saw all the way back when we were discussing prediction that the perceptron and online gradient methods achieved good prediction on data sets no matter what order the data was presented to the algorithm or generated from. Though we call these methods “stochastic gradient descent” or “stochastic approximation,” the argument to verify they work (i.e., have low regret) uses no probability. How can these methods be so generally “decent?” No matter what the model of uncertainty, gradient methods give OK answers. Eventually, I’ll find this less confusing, but I’m not there yet.

Gradient-like methods build a policy sequentially over past data. At each time step, we strive to have an policy that would have done well on your past data but might also do well on future data. Gradient methods take a policy, evaluate how it does on the latest data, and slightly perturb the policy if the results aren’t good enough.

We can measure the performance of these models in terms of regret. We can sum up the costs accrued over T trials and compare them to the best we could have done if we knew the realization of uncertainty in advance. In math:

where we are comparing against some fixed policy π*.

Is low regret what we want? On the upside, this isn’t a worst-case policy. We are not only doing something, but we’re doing as well as we could have done had we known the uncertain outcomes in advance. There’s no randomness in sight. And somehow, gradient-like methods are able to yield low regret.

My main issue with regret is that it needs to judge a sequence of decisions rather than an individual decision. We can’t say anything about the individual decisions in our sequence, only that the average over all of our decisions is reasonable. Is there a better formulation that captures the adversarial problem but gets away from sample averages? Let me know in the comments if you have any proposed strategies.

## Quantifying those unknown unknowns

I don't suppose this counts as a killer app for DRO? https://arxiv.org/pdf/2306.09222.pdf