In our complicated cartography of decision making, I wondered if bandit problems were the easiest possible setting. When you can act quickly without impacting the environment, simple algorithms with minimal uncertainty quantification sufficed for nearly optimal policy. But perhaps there was no need for uncertainty quantification because bandit actions had no external impact. Today, let’s see if the story changes as we move along the x-axis in the map.
The simplest model I know where actions have consequences is the optimal control of linear systems. My first stint of regular blogging in 2018 was all about the interplay between learning and control, and I will reprise many of those themes this week. I think control has many insights as to why this scatter plot looks the way it does, and I’ll try to explain this as simply as I can. Control theory is the study of feedback. And every time I think about feedback I learn something new and surprising.
The decision problem will be set up similar to last week’s. We’ll keep the optimization goal the same: maximizing total aggregate return. We’ll assume that at any given time, the return depends on the current state of the world and the action. But the action now affects the next state. We model this as saying that the next state is a linear function of the current state, the current action, and an unknown noise process. We don’t know which linear function, but we assume it’s linear. An algorithm can use the past stream of returns, actions, and state observations to make the next decision
state, action, return, adaptation, state, action, return, adaptation, state, action, …
Let me give a toy example to illustrate a simple example of linear control and the challenges that immediately arise. Suppose the state is the amount of product on a store’s shelves, the action is bringing out more product from a warehouse, and the noise process is sales of the product. I can write this as an equation:
supply[today]=supply[yesterday]-sales[yesterday]+delivery[yesterday]
Today’s supply is a linear combination of yesterday’s inventory, sales, and deliveries.
We already run into an issue. Since restocking impacts the supply, there’s the potential that our actions drive the system into an untenable situation. Suppose I set the policy that orders half the amount of product that is currently on the shelves. What happens if the store has a bad day and sales drop to zero? Then the store quickly ends up with too much inventory and no place to put it. This is a cartoon example of an “unstable feedback loop.” Acting without a model of the world is challenging and dangerous even in the simplest examples of feedback systems.
Since it’s perilous to randomly experiment with a control system, we need to do some modeling and testing before deployment. So let’s slightly modify the question we’ve been asking. Instead of trying to maximize returns from scratch, let’s ask how much testing is required to design a policy that maximizes returns. Let’s imagine we can do some initial experiments and modeling before deploying a final policy. What is the best thing to do if we have a fixed time under which we can gather observations, build models, quantify uncertainty, and design policies?
My group studied this question for several years in the context of robotics, control, and reinforcement learning. Specifically, in our first paper with Sarah Dean, Horia Mania, Stephen Tu, and Nik Matni, we spent a lot of time thinking about the uncertainty quantification of linear systems, and we built an elegant robust controller using data-driven error bars.
But there was a simpler approach that engineers had used since the 1950s: just fit a model of the dynamics, and then solve the linear quadratic control as if this fit model was true. We’ve seen this idea before. It’s called certainty equivalence. If certainty equivalence was good enough for widespread engineering practice, would our provable robust model be even better? I think you can guess the answer.
Indeed, Horia and Stephen showed that, even in theory land, certainty equivalence worked just fine. In experiments, certainty equivalent control yielded better returns and was as robust as the uncertainty quantified model. The robust model only performed better when the experiment time was unrealistically limited. We could construct contrived examples where the certainty equivalent controller was unstable and the robust controller was not. However, even in these cases, the robust controller was still pretty bad. There are likely scenarios where a poorly performing robust policy is acceptable, but most engineers would rather think about how to change the problem itself to get better performance and robustness. One of the keys to engineering is that the problems are invented, not given.
I discussed last week how the situation with contextual bandits was similar. Greedy algorithms were certainty equivalent methods and were, for most intents and purposes, just as good as methods that leverage careful uncertainty quantification. What ties the bandit and control examples together? It seems that as action speed increases, the need for a precise predictor decreases. But in the next blog, I’m going to revisit my favorite example, showing how even a perfect prediction can ruin your day when your actions have consequences.