Optimization is a frustrating field of a million forking paths. At its core, it seems so simple: find the maximum value subject to constraints. This sounds simple enough. You merely need to find some change that improves your current policy and make the change if it doesn’t violate constraints. Why then do we have a mess of optimization methods that is too much for anyone to handle?

The problem is that every small change in the formulation of an optimization problem leads to a novel analysis. Different methods apply if your variables are continuous or discrete. Different methods apply if your cost function is linear or quadratic. Different methods apply if your constraint sets are of different geometries. The next thing we know, we get a gnarly tree of methods, illustrated at the Neos Guide

And one hundred flowers bloom.

Even in this short course focused on policy optimization, I am trapped organizing the course around toggles. The first distinction was about how we could access a probabilistic model of uncertainty. When we knew the probabilistic model and could compute likelihoods, we reduced policy optimization to hypothesis testing. When we could only access the model through random sampling, we proposed randomized experiments as a solution. Though connected in spirit, the analyses and frameworks were completely different! Machine learning distinguishes these two modes of model access as “model-based” versus “model-free.”

As we move on, we’ll need two more distinctions. First, we have so far only covered problems with binary actions. We could extend these methods to problems with a few actions (like 3-5), but we’ll need different strategies to solve problems with many actions or a continuum of possible actions. Second, since we never make a decision in isolation, we’ll want to understand the consequences of decisions made sequentially. Cascading impacts of decisions pose an even more challenging problem of understanding feedback and dynamics.

Today, I thought I’d just map out where we’ve been and where we’re going in a sort of taxonomy. Though I can break down this taxonomy into ever finer granularity, the basic shape is captured by the three distinctions I’ve laid out so far:

Few actions vs. many actions

Model-based vs. model-free

Single decision vs. sequential decisions

Along each of the eight combinations of these toggles, we get a lecture in the course.

**Case 1. (Binary actions, Model-based, Single Decision):** We reduced policy optimization problems in this case to hypothesis testing.

**Case 2. (Binary actions, Model-free, Single Decision):** We approached this case with randomized experiments. But note that when we were in this “model-free” case, we needed to be able to sample from some population to guide the decision. The opposite of “model-based” should probably be “sample-based” not “model-free.” But I’ll stick with the machine learning lingo today.

**Case 3. (Complex actions, Model-based, Single Decision):** This is a huge space typically called “Stochastic Programming.” I could (and would like to) spend a whole semester diving deep into stochastic optimization. I was trying to figure out how to work this into a machine learning course, but it’s impossible. There are important and powerful methods here where people cleverly manipulate expected values to find tractable formulations. And just like in optimization, there are 31 flavors of stochastic optimization. You just have to string your favorite catchphrases together. “Distributionally robust stochastic mixed integer programming.” Oh, this is a thing. The reason I’m not going to spend time on stochastic optimization is it would require additional mathematical modeling and algorithm description that would bring us way outside what we’ve covered in this class. Machine learning only has one hammer, stochastic gradient descent, and this can’t solve all of the world’s policy optimization problems. But, if we could sample from our model, perhaps we could pull out our hammer…

**Case 4. (Complex actions, Model-free, Single Decision):** See, I told you we’d get to SGD. We’ll focus on Case 4 today. Problems in this class are the subclass of stochastic optimization solved by sampling approximations. These problems could be model-free, or they could just be hard to solve analytically but easy to sample from. In either case, tools that we use in machine learning typically apply here. We’ll show today that SGD solves many optimization problems. In fact, we can apply the *exact same* analysis we used to prove that online linear classifiers had low regret. Another popular method in machine learning, where we replace the expected costs with an average over samples also applies to this case. SGD-based methods are called *stochastic approximation,* whereas the methods that look like empirical risk minimization are called *sample average approximation.* Which approximation you choose is mostly a matter of taste.

**Case 5. (Few actions, Model-free, Sequential)**: Once we move into the realm of sequential decision-making, everything gets more complex. The simplest problem in sequential decision-making is the case with a few actions, known as the multi-armed bandit. It’s an illuminating starting point that we’ll get to on Thursday. Even this simplest of problems poses endless non-trivial algorithm and analysis questions. Over 150 pages of this excellent book by Tor Lattimore and Csaba Szepesvari are dedicated to the problem of sequential decision-making with two actions.

**Case 6. (Few actions, Model-based, Sequential):** Now we’re starting to cook with gas. This is the realm of Markov Decision Processes.

**Case 7. (Many actions, Model-based, Sequential):** At this point, we’re pushing the limits of complexity. Case 7 covers multiple active research fields: Optimal Control, Dynamic Programming, Stochastic Optimization with Recourse. Solving problems in this space is daunting, and the general problems are all intractable. In class, we’ll focus on the problems we can solve with simple methods. Since there are only a few problems we can solve in this case, engineers try to force all of their problems to be one of the tractable ones. Rather than optimizing solutions for new problems, we’re better served forcing our system to be in the regime solvable by simple methods.

**Case 8. (Many actions, model-free, sequential): **Ah, and finally, we are at the pinnacle. Reinforcement Learning. I worry we pretend that these problems are solvable in practice. We couldn’t even solve the problems in Case 7, and all of the solutions we know there are themselves incredibly fragile. What hope do we have in this world of sequential decision-making from streaming data alone? I’ll present some methods, and you can decide for yourself. Having spent nearly a decade studying reinforcement learning, my main conclusion is you can’t do anything beyond board games and simple demos once you’ve reached this level of complexity.

And that will bring us to the end of the semester. Each of those cases could be a semester-long course. It’s only because we narrowly focus on the problems with the simplest solutions that we can cover them all in half a semester. Does this suggest that most papers we see at NeurIPS and ICML only solve the simplest of policy optimization problems? I’d say decidedly so.

What other parts of the policy optimization taxonomy did I miss? Let me know in the comments which other axes would be useful to exhibit.

One axis that isn't clearly included here is online vs. offline data collection. In many cases this might not be particularly important, but I think it's crucial for distinguishing bandits from "full" RL. I wouldn't identify few vs. many actions as the difference between bandits and RL -- there are bandit problems with complex action spaces, and RL problems with small action spaces.

I think "sequential" is hiding two distinct phenomena: data collection vs. dynamics/impacts. In bandit settings, the only long term effects of actions are due to online data collection, while the underlying environment remains unaffected. Something I struggle with is succinctly naming this second meaning of sequential, in which actions impact the decision environment. "Dynamics" isn't quite right, since an environment can be non-stationary but uncontrollable. I often use the word "impact" but I'm not sure it gets the point across. It basically boils down to whether the optimal (model-based) policy is greedy (H=1) or not.

Nice post! I had a thought / reaction which is very similar to Sarah's. I usually distinguish between "information state" MDPs and "normal" MDPs. In the former, the environment is unchanged by the agent's actions, but their belief states about the reward function changes. (For finite number of actions, this gives contextual bandits, for continuous action spaces, this gives something related to BayesOpt - see this nice tutorial by Marc Toussaint for the connnections, https://www.youtube.com/watch?v=5rev-zVx1Ps). In the "normal" MDP case, you need to model dynamics of the environment and dynamics of the belief state .

I think you can unify both if you define a joint model like this:

p(bel(t), env(t), act(t), obs(t) | bel(t-1), env(t-1))

= p(act(t) | bel(t-1)) // policy

* p(obs(t) | env(t-1), act(t)) // nature's "reward" model

* p(bel(t) | bel(t-1), act(t), obs(t)) // deterministic Bayes rule (or some other estimator)

* p(env(t) | env(t-1), act(t)) // for bandits, actions have no impact