Efficient Continuous Control with Double Actors and Regularized Critics

How to obtain good value estimation is one of the key problems in Reinforcement Learning (RL). Current value estimation methods, such as DDPG and TD3, suffer from unnecessary overor underestimation bias. In this paper, we explore the potential of double actors, which has been neglected for a long time, for better value function estimation in continuous setting. First, we uncover and demonstrate the bias alleviation property of double actors by building double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation bias in TD3 respectively. Next, we interestingly find that double actors help improve the exploration ability of the agent. Finally, to mitigate the uncertainty of value estimate from double critics, we further propose to regularize the critic networks under double actors architecture, which gives rise to Double Actors Regularized Critics (DARC) algorithm. Extensive experimental results on challenging continuous control tasks show that DARC significantly outperforms state-of-the-art methods with higher sample efficiency.

[1]  Dmitry Vetrov,et al.  Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics , 2020, ICML.

[2]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[3]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[4]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[5]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[6]  Robert Loftin,et al.  Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[7]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[8]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9]  Lawrence Carin,et al.  Revisiting the Softmax Bellman Operator: New Benefits and New Perspective , 2018, ICML.

[10]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[11]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[12]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[13]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[14]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[15]  Li Xia,et al.  DSAC: Distributional Soft Actor Critic for Risk-Sensitive Reinforcement Learning , 2020 .

[16]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[17]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[18]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[19]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[20]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[21]  Mykel J. Kochenderfer,et al.  Weighted Double Q-learning , 2017, IJCAI.

[22]  Shimon Whiteson,et al.  DAC: The Double Actor-Critic Architecture for Learning Options , 2019, NeurIPS.

[23]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[24]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[25]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[26]  Jianbing Shen,et al.  Reducing Estimation Bias via Triplet-Average Deep Deterministic Policy Gradient , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Behrouz Minaei,et al.  A survey of regularization strategies for deep models , 2019, Artificial Intelligence Review.

[28]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[29]  Alexander Ilin,et al.  Regularizing Model-Based Planning with Energy-Based Models , 2019, CoRL.

[30]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[31]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  Punit Pandey,et al.  Approximate Q-Learning: An Introduction , 2010, 2010 Second International Conference on Machine Learning and Computing.

[33]  Pierluca D'Oro,et al.  How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization , 2020, NeurIPS.

[34]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[35]  Olivier Buffet,et al.  Policy‐Gradient Algorithms , 2013 .

[36]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[38]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[39]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[40]  Srinjoy Roy,et al.  OPAC: Opportunistic Actor-Critic , 2020, ArXiv.

[41]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[42]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[43]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[44]  Longbo Huang,et al.  Softmax Deep Double Deterministic Policy Gradients , 2020, NeurIPS.

[45]  Martha White,et al.  Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[46]  Rui Zhao,et al.  Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[47]  Amr M. A. Khalifa,et al.  On the Reduction of Variance and Overestimation of Deep Q-Learning , 2019, ArXiv.