Continuous Control With Ensemble Deep Deterministic Policy Gradients

The growth of deep reinforcement learning (RL) has brought multiple exciting tools and methods to the field. This rapid expansion makes it important to understand the interplay between individual elements of the RL toolbox. We approach this task from an empirical perspective by conducting a study in the continuous control setting. We present multiple insights of fundamental nature, including: an average of multiple actors trained from the same data boosts performance; the existing methods are unstable across training runs, epochs of training, and evaluation runs; a commonly used additive action noise is not required for effective training; a strategy based on posterior sampling explores better than the approximated UCB combined with the weighted Bellman backup; the weighted Bellman backup alone cannot replace the clipped double Q-Learning; the critics’ initialization plays the major role in ensemble-based actor-critic exploration. As a conclusion, we show how existing tools can be brought together in a novel way, giving rise to the Ensemble Deep Deterministic Policy Gradients (ED2) method, to yield state-of-the-art results on continuous control tasks from OpenAI Gym MuJoCo. From the practical side, ED2 is conceptually straightforward, easy to code, and does not require knowledge outside of the existing RL toolbox.

[1]  P. Abbeel,et al.  SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning , 2020, ICML.

[2]  Marcin Andrychowicz,et al.  What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study , 2020, ArXiv.

[3]  Sergey Levine,et al.  DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction , 2020, NeurIPS.

[4]  Anoop Korattikara Balan,et al.  Measuring the Reliability of Reinforcement Learning Algorithms , 2019, ICLR.

[5]  K. Ross,et al.  Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling , 2019, ICML.

[6]  Larry Rudolph,et al.  A Closer Look at Deep Policy Gradients , 2018, ICLR.

[7]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[8]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[9]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[10]  Uncertainty-sensitive Learning and Planning with Ensembles , 2019, ArXiv.

[11]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.

[12]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[13]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[14]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[15]  Chun Yuan,et al.  Self-Adaptive Double Bootstrapped DDPG , 2018, IJCAI.

[16]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[17]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[18]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[19]  Lei Cao,et al.  Ensemble Network Architecture for Deep Reinforcement Learning , 2018 .

[20]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[21]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[22]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[23]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[24]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[25]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[26]  Shuchang Zhou,et al.  Learning to Run with Actor-Critic Ensemble , 2017, ArXiv.

[27]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[28]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[29]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[30]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[31]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[32]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[33]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[36]  Marco Wiering,et al.  Ensemble Algorithms in Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[37]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[38]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .