论文信息 - Exploiting Multiple Secondary Reinforcers in Policy Gradient Reinforcement Learning

Exploiting Multiple Secondary Reinforcers in Policy Gradient Reinforcement Learning

Most formulations of Reinforcement Learning depend on a single reinforcement reward value to guide the search for the optimal policy solution. If observation of this reward is rare or expensive, converging to a solution can be impractically slow. One way to exploit additional domain knowledge is to use more readily available, but related quantities as secondary reinforcers to guide the search through the space of all policies. We propose a method to augment Policy Gradient Reinforcement Learning algorithms by using prior domain knowledge to estimate desired relative levels of a set of secondary reinforcement quantities. RL can then be applied to determine a policy which will establish these levels. The primary reinforcement reward is then sampled to calculate a gradient for each secondary reinforcer, in the direction of increased primary reward. These gradients are used to improve the estimate of relative secondary values, and the process iterates until reward is maximized. We prove that the algorithm converges to a local optimum in secondary reward space, and that the rate of convergence of the performance gradient estimate in secondary reward space is independent of the size of the state space. Experimental results demonstrate that the algorithm can converge many orders of magnitude faster than standard policy gradient formulations.

Gregory Z. Grudic | Lyle H. Ungar

[1] Marvin Minsky,et al. Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[2] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[3] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4] Peter L. Bartlett,et al. Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[5] L. Ungar,et al. Localizing Policy Gradient Estimates to Action Transitions , 2000 .

[6] William H. Press,et al. Numerical recipes in C , 2002 .

[7] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[8] William H. Press,et al. Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[9] Kee-Eung Kim,et al. Learning to Cooperate via Policy Search , 2000, UAI.

[10] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[11] Gregory Z. Grudic,et al. Localizing Search in Reinforcement Learning , 2000, AAAI/IAAI.

[12] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[13] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14] Richard S. Sutton,et al. Time-Derivative Models of Pavlovian Reinforcement , 1990 .