An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function

We present an analysis of actor/critic algorithms, in which the actor updates its policy using eligibility traces of the policy parameters. Most of the theoretical results for eligibility traces have been for only critic's value iteration algorithms. This paper investigates what the actor's eligibility trace does. The results show that the algorithm is an extension of Williams' REINFORCE algorithms for in nite horizon reinforcement tasks, and then the critic provides an appropriate reinforcement baseline for the actor. Thanks to the actor's eligibility trace, the actor improves its policy by using a gradient of actual return, not by using a gradient of the estimated return in the critic. It enables the agent to learn a fairly good policy under the condition that the approximated value function in the critic is hopelessly inaccurate for conventional actor/critic algorithms. Also, if an accurate value function is estimated by the critic, the actor's learning is dramatically accelerated in our test cases. The behavior of the algorithm is demonstrated through simulations of a linear quadratic control problem and a pole balancing problem.

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  L. Baird,et al.  A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING , 1990 .

[3]  Richard S. Sutton,et al.  Reinforcement learning architectures for animats , 1991 .

[4]  Paul E. Utgoff,et al.  A Teaching Method for Reinforcement Learning , 1992, ML.

[5]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[6]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[7]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[8]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[9]  Andrew G. Barto,et al.  An Actor/Critic Algorithm that is Equivalent to Q-Learning , 1994, NIPS.

[10]  Shigenobu Kobayashi,et al.  Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward , 1995, ICML.

[11]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[12]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[13]  Chin-Teng Lin,et al.  Reinforcement learning for an ART-based fuzzy adaptive learning control network , 1996, IEEE Trans. Neural Networks.

[14]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 1996, Machine Learning.

[15]  Kenji Doya,et al.  Efficient Nonlinear Control with Actor-Tutor Architecture , 1996, NIPS.

[16]  Mark D. Pendrith,et al.  Actual Return Reinforcement Learning versus Temporal Differences: Some Theoretical and Experimental Results , 1996, ICML.

[17]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.