论文信息 - Projected Natural Actor-Critic

Projected Natural Actor-Critic

Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.

[1] Sridhar Mahadevan,et al. Basis Adaptation for Sparse Nonlinear Reinforcement Learning , 2013, AAAI.

[2] Antonie J. van den Bogert,et al. A Real-Time, 3-D Musculoskeletal Model for Dynamic Simulation of Arm Movements , 2009, IEEE Transactions on Biomedical Engineering.

[3] Andrew G. Barto,et al. Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[4] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5] Philip Thomas,et al. Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[6] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[8] Philip S. Thomas,et al. Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm , 2009, IAAI.

[9] Scott Kuindersma,et al. Variational Bayesian Optimization for Runtime Risk-Sensitive Control , 2012, Robotics: Science and Systems.

[10] Fritz Wysotzki,et al. Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[11] Scott Kuindersma,et al. Dexterous mobility with the uBot-5 mobile manipulator , 2009, 2009 International Conference on Advanced Robotics.

[12] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[13] Ari Arapostathis,et al. Control of Markov chains with safety bounds , 2005, IEEE Transactions on Automation Science and Engineering.

[14] Bogert Aj. A Proportional Derivative FES Controller for Planar Arm Movement , 2007 .

[15] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[16] 丸山徹. Convex Analysisの二,三の進展について , 1977 .

[17] Bo Liu,et al. Sparse Q-learning with Mirror Descent , 2012, UAI.

[18] Shun-ichi Amari,et al. Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19] Judy A. Franklin,et al. Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[20] Patrick M. Pilarski,et al. Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[21] Marc Teboulle,et al. Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[22] Nuno C. Martins,et al. Control Design for Markov Chains under Safety Constraints: A Convex Approach , 2012, ArXiv.

[23] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[24] Robert F. Kirsch,et al. Combined feedforward and feedback control of a redundant, nonlinear, dynamic musculoskeletal system , 2009, Medical & Biological Engineering & Computing.

[25] Andrew G. Barto,et al. Motor primitive discovery , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[26] Neil Munro,et al. Fast calculation of stabilizing PID controllers , 2003, Autom..

[27] Roderic A. Grupen,et al. Whole-body strategies for mobility and manipulation , 2010 .

[28] C. Lynch,et al. Functional Electrical Stimulation , 2017, IEEE Control Systems.

[29] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[30] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[31] Karl Johan Åström,et al. PID Controllers: Theory, Design, and Tuning , 1995 .