Exploiting the Sign of the Advantage Function to Learn Deterministic Policies in Continuous Domains

In the context of learning deterministic policies in continuous domains, we revisit an approach, which was first proposed in Continuous Actor Critic Learning Automaton (CACLA) and later extended in Neural Fitted Actor Critic (NFAC). This approach is based on a policy update different from that of deterministic policy gradient (DPG). Previous work has observed its excellent performance empirically, but a theoretical justification is lacking. To fill this gap, we provide a theoretical explanation to motivate this unorthodox policy update by relating it to another update and making explicit the objective function of the latter. We furthermore discuss in depth the properties of these updates to get a deeper understanding of the overall approach. In addition, we extend it and propose a new trust region algorithm, Penalized NFAC (PeNFAC). Finally, we experimentally demonstrate in several classic control problems that it surpasses the state-of-the-art algorithms to learn deterministic policies.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[3]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[4]  Dock Bumpers,et al.  Volume 2 , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[5]  Thomas G. Dietterich,et al.  In Advances in Neural Information Processing Systems 12 , 1991, NIPS 1991.

[6]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[11]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[12]  Kilian Q. Weinberger,et al.  Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , 2016 .

[13]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[14]  Pierre-Yves Oudeyer,et al.  GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms , 2017, ICML.

[15]  Yann Boniface,et al.  Developmental Reinforcement Learning through Sensorimotor Space Enlargement , 2018, 2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[16]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[17]  Louis Weinberg,et al.  Automation and Remote Control , 1957 .

[18]  Yann Boniface,et al.  Neural fitted actor-critic , 2016, ESANN.

[19]  Peter Tino,et al.  IEEE Transactions on Neural Networks , 2009 .

[20]  山田 祐,et al.  Open Dynamics Engine を用いたスノーボードロボットシミュレータの開発 , 2007 .

[21]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[22]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[23]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[24]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[25]  Kathleen Steinhöfel,et al.  European Symposium on Artificial Neural Networks , 2001 .