Gradient-based Reinforcement Planning in Policy-Search Methods

AbstractWe introduce a learning method called “gradient-based reinforcement planning” (GREP).Unlike traditional DP methods that improve their policy backwards in time, GREP is agradient-based method that plans ahead and improves its policy before it actually acts in theenvironment. We derive formulas for the exact policy gradient that maximizes the expectedfuture reward and confirm our ideas with numerical experiments. 1 Introduction It has been shown that planning can dramatically improve convergence in reinforcement learning(RL) (?; ?). However, most RL methods that explicitly use planning that have been proposed arevalue (or Q-value) based methods, such as Dyna-Q or prioritized sweeping .Recently, much attention is directed to so-called policy-gradient methods that improve theirpolicy directly by calculating the derivative of the future expected reward with respect to thepolicy parameters. Gradient based methods are believed to be more advantageous than value-function based methods in huge state spaces and in POMDP settings. Probably the first gradientbased RL formulation is class of REINFORCE algorithms ofWilliams (?). Other more recent methods are, e.g., (?; ?; ?). Our approach of deriving thegradient has the flavor of (?) who derive the gradient using future state probabilities.Our novel contribution in this paper is to combine gradient-based learning with explicit plan-ning. We introduce “gradient-based reinforcement planning” (GREP) that improves a policy