We now begin the final phase of the class, finding policies that interact with a changing world. For our first step in dynamical decision making, we return to the case of binary actions and no model of the world. The one thing we change is how we evaluate. So far, we have judged policies with respect to their final answer. Today, we investigate what changes if we evaluate them in an *online* manner.

This turn brings us back to the notion of *regret*, previously discussed in the context of prediction and perceptrons. Regret compares the cumulative reward gained with a set of samples to the reward that the best constant policy could have achieved had it known the probability distribution.

a* here denotes the best constant action. We compare it to the sequence of actions we try on the data that we observe. We’d like to have as close to zero regret as possible. If the reward function is between 0 and 1, the worst this sum can be is T. For the perceptron, where we aimed to minimize the number of prediction errors, the regret was constant.

Regret judges how well we use each observation. We want an algorithm that does as well as if we knew the best option in advance. We can’t go out and try completely crazy policies for long stretches of time. We care about our performance on each sample from the distribution and want a decent solution even when there isn’t much data. This focus on the entire time horizon limits the algorithms we can use.

What I like about this notion of regret is that it can demonstrate a trade-off between learning and good policy. If we don’t have a probabilistic model for our uncertainty, can we learn the model as we execute a policy? How closely can we track the performance of the counterfactual where we did know the model? The answer is we can do pretty well! However, I find the solutions to be a bit unsatisfying. Let’s work through a simple example to see how regret minimization is a powerful but limited concept.

The simplest regret minimization problem, where we have binary actions, is called the *multiarmed bandit*. The multiarmed bandit is essentially an online randomized experiment. We have two treatments, and we’d like to know which is best. We do this by trying the treatment on units individually. We are judged by comparing ourselves to knowing the best treatment from the very beginning.

There’s a simple enough strategy here: run an experiment on N units and look at which treatment is best on this first batch. Then use this estimated best on the remainder of the units. This is called *Explore-Then-Commit*. The phase where we are using samples to find the best option is called *exploration*. The phase where we are sticking with our estimate of the best is called *exploitation*. For the multiarmed bandit, Explore-Then-Commit is not the worst thing in the world. With no assumptions, its regret is T^{⅔}, which is better than T.

We can do better! Suppose we run a trial with N units but are unsure which treatment is best. Well, we could then run an experiment with 2N units to see if that better informs our decision. If we’re still unsure after 2N, we can do an even bigger experiment with 4N units. If we keep doubling, we’ll eventually decide one treatment is better than the other. For the people out there reading this who like RCTs, this algorithm is the same as running an RCT of some size and then, if the null is not rejected, doubling the sample size in a second experiment. Eventually, we’ll reject the null! This method of doubling turns out to be optimal.

There’s an important lesson lurking in the simple multi-armed bandit problem. It’s optimal here to learn a model and then run with a model as if it were true. This is called the *principle of* *certainty equivalence*. *How* you learn the model can be sophisticated. In the multiarmed bandit, we would need to repeat experiments until we were very sure we had found the best option. Nonetheless, we’re applying the principle of certainty equivalence. We follow a pure exploration phase with a pure exploitation phase.

Is this the best solution in general? Can we always just muck around for a while and then commit to what we think is best from experiments? It really depends on the context in which we’re building a policy. In the multi-armed bandit, we can experiment forever because our experiments have no impact on the world. The only thing we worry about in the regret model is the value of the optimization problem. But what if your experiments *do* impact the world? What if this impact changes the sorts of samples you see in the future?

It’s at this point I confess I need to add another branch to my phylogenetic tree of policy optimization. As Sarah Dean pointed out yesterday, there is a vast difference between dynamic models where the policy impacts future outcomes versus those that simply passively observe. Actions with impact are different than actions that passively observe. This distinction is critical.

Incorporating a notion of impact is tricky in bandit algorithms. Problems where the environment changes as we interact look very different from problems where we passively gather information and reward. Bandits tell us something interesting about how much model identification is needed when seeking to maximize reward. It *is* telling us something about dynamical policy optimization. The online setting restricts the sorts of algorithms we can use, but it can’t address the hard questions of adaptivity, feedback, and recourse. We’ll grapple with those next week.