论文信息 - Towards Characterizing Divergence in Deep Q-Learning - 字舞流文

Towards Characterizing Divergence in Deep Q-Learning

Deep Q-Learning (DQL), a family of temporal difference algorithms for control, employs three techniques collectively known as the `deadly triad' in reinforcement learning: bootstrapping, off-policy learning, and function approximation. Prior work has demonstrated that together these can lead to divergence in Q-learning algorithms, but the conditions under which divergence occurs are not well-understood. In this note, we give a simple analysis based on a linear approximation to the Q-value updates, which we believe provides insight into divergence under the deadly triad. The central point in our analysis is to consider when the leading order approximation to the deep-Q update is or is not a contraction in the sup norm. Based on this analysis, we develop an algorithm which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). We demonstrate that our algorithm performs above or near state-of-the-art on standard MuJoCo benchmarks from the OpenAI Gym.

Pieter Abbeel | Joshua Achiam | Ethan Knight | P. Abbeel | Joshua Achiam | Ethan Knight

[1] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[2] Thomas Brox,et al. CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity , 2019, 1902.05605.

[3] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[4] Matteo Hessel,et al. Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[5] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[7] Ethan Knight,et al. Natural Gradient Deep Q-learning , 2018, ArXiv.

[8] Sean P. Meyn,et al. An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[9] Alistair A. Young,et al. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2017, MICCAI 2017.

[10] Rémi Munos,et al. Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[11] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[13] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[14] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[15] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[16] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[17] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[18] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[19] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[22] Sergey Levine,et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[23] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[24] Peter Stone,et al. TD Learning with Constrained Gradients , 2018 .

[25] Rémi Munos,et al. Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[26] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[27] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[28] Jaehoon Lee,et al. Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[29] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[30] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[31] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[32] Longxin Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[33] Sergey Levine,et al. Diagnosing Bottlenecks in Deep Q-learning Algorithms , 2019, ICML.

[34] Zhuoran Yang,et al. A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[35] Sham M. Kakade,et al. Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[36] Xiaohui Ye,et al. Horizon: Facebook's Open Source Applied Reinforcement Learning Platform , 2018, ArXiv.

[37] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[38] Sergey Levine,et al. Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[39] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.