论文信息 - Deep Residual Reinforcement Learning (Extended Abstract) - 字舞流文

Deep Residual Reinforcement Learning (Extended Abstract)

We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in commonly used benchmarks. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD(k) method, our residualbased method makes weaker assumptions about the model and yields a greater performance boost.

Shimon Whiteson | Shangtong Zhang | Wendelin Boehmer | Shangtong Zhang | Wendelin Boehmer | Shimon Whiteson

[1] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[2] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3] Peter A. Flach,et al. Proceedings of the 28th International Conference on Machine Learning , 2011 .

[4] P. Schrimpf,et al. Dynamic Programming , 2011 .

[5] Sergey Levine,et al. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[6] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[7] V. Kaul,et al. Planning , 2012 .

[8] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[9] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[10] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[11] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[12] R. Lathe. Phd by thesis , 1988, Nature.

[13] Roland Siegwart,et al. Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems , 2020, Adaptive Agents and Multi-Agent Systems.

[14] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[15] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[16] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[17] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[18] Shimon Whiteson,et al. Breaking the Deadly Triad with a Target Network , 2021, ICML.

[19] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[20] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[21] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[22] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23] Lihong Li,et al. A worst-case comparison between temporal difference and residual gradient with linear function approximation , 2008, ICML '08.

[24] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..