Natural-Gradient Actor-Critic Algorithms

We prove the convergence of four new reinforcement learning algorithms based on the actorcritic architecture, on function approximation, and on natural gradients. Reinforcement learning is a class of methods for solving Markov decision processes from sample trajectories under lack of model information. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient ascent/descent. Reinforcement learning methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle infinite or large state spaces, and the use of temporal-difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the estimates. Natural or Fisher-information versions of policy gradient algorithms are of interest because they can produce better conditioned parameterizations and have been shown to have further reduced variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural-gradient actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. We present empirical results verifying the convergence of our algorithms. A limitation of our results is that they do not address the use of least-squares methods and eligibility traces.

[1]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[2]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[3]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[4]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[5]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[6]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[8]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[9]  M. Narasimha Murty,et al.  Information theoretic justification of Boltzmann selection and its generalization to Tsallis case , 2005, 2005 IEEE Congress on Evolutionary Computation.

[10]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[11]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[12]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[15]  Solomon Lefschetz,et al.  Stability by Liapunov's Direct Method With Applications , 1962 .

[16]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[17]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[18]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[19]  V. Borkar Stochastic approximation with two time scales , 1997 .

[20]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[21]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[22]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[23]  John Rust Numerical dynamic programming in economics , 1996 .

[24]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[25]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[26]  Michail G. Lagoudakis,et al.  Model-Free Least-Squares Policy Iteration , 2001, NIPS.

[27]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[28]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[29]  S. Andradóttir,et al.  A Simulated Annealing Algorithm with Constant Temperature for Discrete Stochastic Optimization , 1999 .

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[32]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[33]  Vladislav Tadic,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001, Machine Learning.

[34]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[35]  J. Tsitsiklis,et al.  An optimal one-way multigrid algorithm for discrete-time stochastic control , 1991 .

[36]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[37]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[38]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[39]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[40]  Jonathan Baxter KnightCap : A chess program that learns by combining TD ( ) with game-tree search , 1998 .

[41]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[42]  Alborz Geramifard,et al.  Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.

[43]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[44]  Shalabh Bhatnagar,et al.  A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes , 2004, IEEE Transactions on Automatic Control.

[45]  V. Borkar Recursive self-tuning control of finite Markov chains , 1997 .

[46]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[47]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[48]  James W. Daniel,et al.  Splines and efficiency in dynamic programming , 1976 .

[49]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[50]  Abraham Thomas,et al.  LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 2009 .

[51]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[52]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[53]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[54]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[55]  Shalabh Bhatnagar,et al.  Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes , 2007, Discret. Event Dyn. Syst..