论文信息 - Gradient-based Reinforcement Planning in Policy-Search Methods

Gradient-based Reinforcement Planning in Policy-Search Methods

AbstractWe introduce a learning method called “gradient-based reinforcement planning” (GREP).Unlike traditional DP methods that improve their policy backwards in time, GREP is agradient-based method that plans ahead and improves its policy before it actually acts in theenvironment. We derive formulas for the exact policy gradient that maximizes the expectedfuture reward and conﬁrm our ideas with numerical experiments. 1 Introduction It has been shown that planning can dramatically improve convergence in reinforcement learning(RL) (?; ?). However, most RL methods that explicitly use planning that have been proposed arevalue (or Q-value) based methods, such as Dyna-Q or prioritized sweeping .Recently, much attention is directed to so-called policy-gradient methods that improve theirpolicy directly by calculating the derivative of the future expected reward with respect to thepolicy parameters. Gradient based methods are believed to be more advantageous than value-function based methods in huge state spaces and in POMDP settings. Probably the ﬁrst gradientbased RL formulation is class of REINFORCE algorithms ofWilliams (?). Other more recent methods are, e.g., (?; ?; ?). Our approach of deriving thegradient has the ﬂavor of (?) who derive the gradient using future state probabilities.Our novel contribution in this paper is to combine gradient-based learning with explicit plan-ning. We introduce “gradient-based reinforcement planning” (GREP) that improves a policy

[1] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[2] F. C. Difilippo,et al. Adjoint Monte Carlo methods for radiotherapy treatment planning , 1996 .

[3] Andrew Y. Ng,et al. Policy Search via Density Estimation , 1999, NIPS.

[4] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[5] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[6] Jürgen Schmidhuber,et al. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[7] P. Bartlett,et al. Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .