Greetings Professor Falken
The strange game of policy optimization on long time horizons
In 1950, Claude Shannon published a paper on how to program a computer to play chess. Building on recent work in game theory, he remarked
“With chess it is possible, in principle, to play a perfect game or construct a machine to do so as follows: One considers in a given position all possible moves, then all moves for the opponent, etc., to the end of the game (in each variation). The end must occur, by the rules of the games after a finite number of moves (remembering the 50 move drawing rule). Each of these variations ends in win, loss or draw. By working backward from the end one can determine whether there is a forced win, the position is a draw or is lost.”
Though this was before computers existed, Shannon was already thinking about computational complexity. He sketched out some calculations and realized it would probably never be possible to solve chess in this perfect way. He continues:
“It is easy to show, however, even with the high computing speed available in electronic calculators this computation is impractical. In typical chess positions there will be of the order of 30 legal moves. The number holds fairly constant until the game is nearly finished... Thus a move for White and then one for Black gives about 103 possibilities. A typical game lasts about 40 moves to resignation of one party. This is conservative for our calculation since the machine would calculate out to checkmate, not resignation. However, even at this figure there will be 10120 variations to be calculated from the initial position.”
Shannon’s proposed solution was to plan only out to a certain depth. Planning only two or three moves into the future seemed potentially tractable with the technology of the day. At this shallow depth of search, it would likely not be possible to know whether a player had won, lost, or drawn. Most games don’t end in 2 or 3 moves. Instead, Shannon proposed evaluating a function that approximated the outcome. This function would heuristically assemble some notions of board strength that could predict the outcome. Shannon’s proposal thus combined a heuristic assessment of board value with a shallow search through the tree of moves. 47 years later, this strategy would lead to the defeat of humans in chess. With some clever tricks for tree searching, the same strategy would conquer Go 19 years after that.
We can think of Shannon’s two algorithms—the perfect and the approximate—as Dynamic Programming and Model Predictive Control. Everything Shannon describes about chess applies to our stochastic optimization master problem. Recall our long-standing goal of finding policies that map states x to an actions a. The states change over time depending on the current state and the actions we choose. We have been assuming the associated reward and dynamics are random. Putting this all together gives us the master problem.
Dynamic Programming gives a master solution to this master problem. By considering all possible actions, there is a collection of potential end states. Each end state has some associated reward. Moving backwards, we can compute the action that gets us to an optimal state at each time step. This is exactly the same as in Shannon’s proposed chess algorithm. And just like in chess, the computational cost of Dynamic Programming can be prohibitive.
In some exceptional cases, we can efficiently compute this backward search. Notably, in problems with a small number of discrete states and actions, we only have to maintain a table of values of state-action pairs. Our computer memory will dictate how large this can be. Also, if we have linear dynamics and quadratic rewards, Dynamic Programming is wonderfully simple. But if we have more complex reward functions or dynamics, we’re left with nearly intractable problems.
This leads us to ask how to approximate Dynamic Programming. There are two meta-solutions that are popular here. One meta-solution writes down the nitty gritty details of Dynamic Programming and attempts to approximately solve the Dynamic Programming iterations. These methods include policy iteration, value iteration, temporal differencing methods, and Q-learning (Consider yourself fortunate if you’ve never heard of these).
Another approach is to run Shannon’s algorithm:
Plan on a tractable, short time horizon.
Approximate the terminal cost in some reasonable way.
Take a step, gather feedback, and replan (goto 1)
This three-step procedure is what we now call “Model Predictive Control” (MPC). After PID control, MPC is by far the most implemented control method in industrial systems. It works because though we only approximately plan into the future, we get feedback every time we act and replan according to this feedback.
There is an art to MPC that must be learned through experience. Too short a planning horizon or a poor choice of terminal cost can lead to catastrophically bad decision making. Doing justice to MPC would require a semester course. We offer one at Berkeley (ME231A) if you’re interested.
In previous blogs, I have described how our uncertainty about the future makes policy optimization challenging. But even when we have a meticulous model of how our actions impact the future, Dynamic Programming shows us how computing policies over long time horizons is computationally daunting. Yet there are sweet spots where we have just the right amounts of computational power, predictive capability, and ability to impact. And in these sweet spots, we engineer the impossible.