论文信息 - Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling - 字舞流文

Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling

We aim to develop off-policy DRL algorithms that not only exceed state-of-the-art performance but are also simple and minimalistic. For standard continuous control benchmarks, Soft Actor Critic (SAC), which employs entropy maximization, currently provides state-of-the-art performance. We first demonstrate that the entropy term in SAC addresses action saturation due to the bounded nature of the action spaces. With this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients. We show that both approaches can match SAC's sample efficiency performance without the need of entropy maximization. We then propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. Extensive experimental results demonstrate that our proposed sampling scheme leads to state of the art sample efficiency on challenging continuous control tasks. We combine all of our findings into one simple algorithm, which we call Streamlined Off Policy with Emphasizing Recent Experience, for which we provide robust public-domain code.

Che Wang | Keith Ross | Quan Vuong | Yanqiu Wu

[1] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[2] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[3] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[4] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[5] Sergey Levine,et al. Diagnosing Bottlenecks in Deep Q-learning Algorithms , 2019, ICML.

[6] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[7] Yuval Tassa,et al. Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[8] Emanuel Todorov,et al. General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[9] Yiming Zhang,et al. Supervised Policy Update for Deep Reinforcement Learning , 2018, ICLR.

[10] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[11] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[12] Karl Tuyls,et al. The importance of experience replay database composition in deep reinforcement learning , 2015 .

[13] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[14] R Ratcliff,et al. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[15] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[16] Petros Koumoutsakos,et al. Remember and Forget for Experience Replay , 2018, ICML.

[17] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[18] Nicolas Le Roux,et al. Understanding the impact of entropy on policy optimization , 2018, ICML.

[19] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[20] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[21] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[22] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[23] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[24] Christopher Schulze,et al. ViZDoom: DRQN with Prioritized Experience Replay, Double-Q Learning, & Snapshot Ensembling , 2018, IntelliSys.

[25] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[26] Volker Tresp,et al. Curiosity-Driven Experience Prioritization via Density Estimation , 2018, ArXiv.

[27] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[28] Peter Henderson,et al. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[29] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[32] Pieter Abbeel,et al. Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[33] Michael McCloskey,et al. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[34] Sebastian Scherer,et al. Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution , 2017, ICML.

[35] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[36] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[37] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[38] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[39] Chunlin Chen,et al. A novel DDPG method with prioritized experience replay , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[40] Han Liu,et al. Marginal Policy Gradients: A Unified Family of Estimators for Bounded Action Spaces with Applications , 2018, ICLR.

[41] James L. McClelland,et al. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[42] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[43] R. French. Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[44] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[45] Anthony V. Robins,et al. Catastrophic Forgetting, Rehearsal and Pseudorehearsal , 1995, Connect. Sci..

[46] Shin-ichi Maeda,et al. Clipped Action Policy Gradient , 2018, ICML.

[47] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[48] Peter Stone,et al. Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[49] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[50] Marc Toussaint,et al. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.