An Actor-Critic Algorithm With Second-Order Actor and Critic

Actor-critic algorithms solve dynamic decision making problems by optimizing a performance metric of interest over a user-specified parametric class of policies. They employ a combination of an actor, making policy improvement steps, and a critic, computing policy improvement directions. Many existing algorithms use a steepest ascent method to improve the policy, which is known to suffer from slow convergence for ill-conditioned problems. In this paper, we first develop an estimate of the (Hessian) matrix containing the second derivatives of the performance metric with respect to policy parameters. Using this estimate, we introduce a new second-order policy improvement method and couple it with a critic using a second-order learning method. We establish almost sure convergence of the new method to a neighborhood of a policy parameter stationary point. We compare the new algorithm with some existing algorithms in two applications and demonstrate that it leads to significantly faster convergence.

[1]  Jing Wang,et al.  Least squares temporal difference actor-critic methods with applications to robot motion control , 2011, IEEE Conference on Decision and Control and European Control Conference.

[2]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[3]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[4]  Jing Wang,et al.  A Hessian actor-critic algorithm , 2014, 53rd IEEE Conference on Decision and Control.

[5]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[6]  Shalabh Bhatnagar,et al.  Natural actorcritic algorithms. , 2009 .

[7]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[8]  Jin Yu,et al.  Natural Actor-Critic for Road Traffic Optimisation , 2006, NIPS.

[9]  Ioannis Ch. Paschalidis,et al.  A least squares temporal difference actor–critic algorithm with applications to warehouse management , 2012 .

[10]  Sham M. Kakade,et al.  Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[11]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[12]  Jing Wang,et al.  Temporal logic motion control using actor–critic methods , 2012, 2012 IEEE International Conference on Robotics and Automation.

[13]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[14]  P. Nagaraju,et al.  Application of actor-critic learning algorithm for optimal bidding problem of a Genco , 2003, 2003 IEEE Power Engineering Society General Meeting (IEEE Cat. No.03CH37491).

[15]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[16]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[17]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[18]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[19]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[20]  S. A. Soman,et al.  Application of actor-critic learning algorithm for optimal bidding problem of a Genco , 2002 .

[21]  Takashi Omori,et al.  Adaptive internal state space construction method for reinforcement learning of a real-world agent , 1999, Neural Networks.

[22]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[23]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[24]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  D. Bertsekas,et al.  Approximate solution methods for partially observable markov and semi-markov decision processes , 2006 .

[26]  Ioannis Ch. Paschalidis,et al.  An actor-critic method using Least Squares Temporal Difference learning , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[29]  Mehdi Khamassi,et al.  Actor–Critic Models of Reinforcement Learning in the Basal Ganglia: From Natural to Artificial Rats , 2005, Adapt. Behav..

[30]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[31]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).