Implicit incremental natural actor critic algorithm

Natural policy gradient (NPG) methods are promising approaches to finding locally optimal policy parameters. The NPG approach works well in optimizing complex policies with high-dimensional parameters, and the effectiveness of NPG methods has been demonstrated in many fields. However, the incremental estimation of the NPG is computationally unstable owing to its high sensitivity to the step-sizes values, especially to the one used to update the estimate of NPG. In this study, we propose a new incremental and stable algorithm for the NPG estimation. We call the proposed algorithm the implicit incremental natural actor critic (I2NAC), and it is based on the idea of the implicit update. The convergence analysis for I2NAC is provided. Theoretical analysis results indicate the stability of I2NAC and the instability of conventional incremental NPG methods. Numerical experiments were performed, and the results show that I2NAC is less sensitive to the values of the meta-parameters, including the step-size for the NPG update, compared to the existing incremental NPG method.

[1]  Jin Yu,et al.  Natural Actor-Critic for Road Traffic Optimisation , 2006, NIPS.

[2]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[3]  Edoardo M. Airoldi,et al.  Scalable estimation strategies based on stochastic approximations: classical results and new insights , 2015, Statistics and Computing.

[4]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[5]  V. Borkar Stochastic approximation with two time scales , 1997 .

[6]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[7]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[8]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[11]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[12]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[13]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[14]  Isao Ono,et al.  Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks , 2010, NIPS.

[15]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[16]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[17]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.