论文信息 - Spurious Solutions to the Bellman Equation

Spurious Solutions to the Bellman Equation

Reinforcement learning algorithms often work by finding functions that satisfy the Bellman equation. This yields an optimal solution for prediction with Markov chains and for controlling a Markov decision process (MDP) with a finite number of states and actions. This approach is also frequently applied to Markov chains and MDPs with infinite states. We show that, in this case, the Bellman equation may have multiple solutions, many of which lead to erroneous predictions and policies (Baird, 1996). Algorithms and conditions are presented that guarantee a single, optimal solution to the Bellman equation. 1 REINFORCEMENT LEARNING AND DYNAMIC PROGRAMMING 1.1 THE BELLMAN EQUATION Reinforcement learning algorithms often work by using some form of dynamic programming to find functions that satisfy the Bellman equation. For example, in a pure prediction problem, the true, optimal value of a state, V*(xt), is defined as equation (1), where < > represents the expected value, taken over all possible sequences of states after time t, γ is a discount factor between zero and one exclusive, and R is the reinforcement received on each time step. V * (xt ) = Rt + γRt+1 + γ 2Rt + 2 + γ 3Rt +3 + Κ (1) It is clear from equation (1) that there is a simple relationship between successive states. This relationship is given in equation (2), and is referred to as the Bellman equation for this problem. V * (xt ) = Rt + γV * (x t+1 ) (2) Bellman equations can be derived similarly for other algorithms such as Q-learning (Watkins, 1989) or advantage learning (Baird, 1993, Harmon and Baird, 1996). 1.2 UNIQUE SOLUTIONS A learning system will maintain an approximation V to the true answer V*, and the difference between the two can be called the error e, defined in equation (3). Equation (4) shows why dynamic programming works. If the learning system can find a function V that satisfies equation (2) for all states, then equation (4) will also hold for all states. V(x t ) = V * (x t ) + e(x t ) (3) V * (xt ) + e(xt ) = Rt + γ V *(xt +1) + e(x t+1) ( ) V * (xt ) + e(xt ) = Rt + γV * (xt +1 ) + γ e(xt +1 ) e(xt ) = γ e(xt +1) (4) Suppose there are a finite number of states, and call the state with the largest error state xt. The discount factor γ is a positive number less than 1, so equation (4) says that the largest error is equal to only a fraction of a weighted average of all the errors. The only way this could happen is if all the errors were zero. Thus, for a finite number of states, the Bellman equation has a unique solution, and that solution is optimal. On the basis of this result, reinforcement learning systems have been created that simply try to find a function V that satisfies the Bellman equation (e.g., Tesauro, 1994; Crites and Barto, 1995). But will such a V will be optimal, even when there are an infinite number of states? Can we assume that the finite-state results will also apply to the infinite-state case? 2 SPURIOUS SOLUTIONS It would be useful to determine under what conditions dynamic programming is guaranteed to find not only a value function that satisfies equation (4), but also a value function whose value error, defined in equation (5), is zero for all x. e(x) = V *(x) − V(x) (5) One solution to equation (4) is the optimal value function V*. However, in many cases there might exist more than a single, unique solution to the Bellman equation (Baird, 1996). If there is a finite number of states then there does exist a unique solution to equation (4). If there is an infinite number of states then there may exist an infinite number of solutions to the Bellman equation, including some with a suboptimal value function or policy. 2.1 THE INFINITE-HALL PROBLEM Consider the simple case of a Markov chain with countably-infinite states, named 0, 1, 2, ..., and with a reinforcement of zero on every transition (Figure 1). 1 2• 3 4 5 0 0 0 0 0 Figure 1: Infinite Markov chain On each time step, the state number is increased by one. The Bellman equation, error relationship, and general solution for this Markov chain are given in equations (6), (7), and (8) respectively. V(x t) = γV(xt +1) (6) e(xt ) = γe(x t) (7)

Mance E. Harmon | Leemon C. Baird

[1] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[2] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[3] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[4] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[5] Mance E. Harmon,et al. Multi-Agent Residual Advantage Learning with General Function Approximation. , 1996 .