Projected Natural Actor-Critic

Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.

[1]  Sridhar Mahadevan,et al.  Basis Adaptation for Sparse Nonlinear Reinforcement Learning , 2013, AAAI.

[2]  Antonie J. van den Bogert,et al.  A Real-Time, 3-D Musculoskeletal Model for Dynamic Simulation of Arm Movements , 2009, IEEE Transactions on Biomedical Engineering.

[3]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[4]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[6]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[8]  Philip S. Thomas,et al.  Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm , 2009, IAAI.

[9]  Scott Kuindersma,et al.  Variational Bayesian Optimization for Runtime Risk-Sensitive Control , 2012, Robotics: Science and Systems.

[10]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[11]  Scott Kuindersma,et al.  Dexterous mobility with the uBot-5 mobile manipulator , 2009, 2009 International Conference on Advanced Robotics.

[12]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[13]  Ari Arapostathis,et al.  Control of Markov chains with safety bounds , 2005, IEEE Transactions on Automation Science and Engineering.

[14]  Bogert Aj A Proportional Derivative FES Controller for Planar Arm Movement , 2007 .

[15]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[16]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[17]  Bo Liu,et al.  Sparse Q-learning with Mirror Descent , 2012, UAI.

[18]  Shun-ichi Amari,et al.  Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[20]  Patrick M. Pilarski,et al.  Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[21]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[22]  Nuno C. Martins,et al.  Control Design for Markov Chains under Safety Constraints: A Convex Approach , 2012, ArXiv.

[23]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[24]  Robert F. Kirsch,et al.  Combined feedforward and feedback control of a redundant, nonlinear, dynamic musculoskeletal system , 2009, Medical & Biological Engineering & Computing.

[25]  Andrew G. Barto,et al.  Motor primitive discovery , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[26]  Neil Munro,et al.  Fast calculation of stabilizing PID controllers , 2003, Autom..

[27]  Roderic A. Grupen,et al.  Whole-body strategies for mobility and manipulation , 2010 .

[28]  C. Lynch,et al.  Functional Electrical Stimulation , 2017, IEEE Control Systems.

[29]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[30]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[31]  Karl Johan Åström,et al.  PID Controllers: Theory, Design, and Tuning , 1995 .