Breaking the Deadly Triad with a Target Network

The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously. In this paper, we investigate the target network as a tool for breaking the deadly triad, providing theoretical support for the conventional wisdom that a target network stabilizes training. We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections. We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points. Those algorithms are off-policy with linear function approximation and bootstrapping, spanning both policy evaluation and control, as well as both discounted and average-reward settings. In particular, we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.

[1]  R. Sutton,et al.  Average-Reward Off-Policy Policy Evaluation with Function Approximation , 2021, ICML.

[2]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints , 2021, NeurIPS.

[3]  R. Sutton,et al.  Learning and Planning in Average-Reward Markov Decision Processes , 2020, ICML.

[4]  Yue Wang,et al.  Finite-sample Analysis of Greedy-GQ with Linear Function Approximation under Markovian Noise , 2020, UAI.

[5]  Quanquan Gu,et al.  A Finite Time Analysis of Two Time-Scale Actor Critic Methods , 2020, NeurIPS.

[6]  Quanquan Gu,et al.  A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation , 2019, ICML.

[7]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[8]  Niao He,et al.  A Unified Switching System Perspective and O.D.E. Analysis of Q-Learning Algorithms , 2019, ArXiv.

[9]  Hengshuai Yao,et al.  Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation , 2019, ICML.

[10]  Shalabh Bhatnagar,et al.  A Convergent Off-Policy Temporal Difference Algorithm , 2019, ECAI.

[11]  Ana Busic,et al.  Zap Q-Learning With Nonlinear Function Approximation , 2019, NeurIPS.

[12]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[13]  Thinh T. Doan,et al.  Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning , 2019, Autom..

[14]  Jason D. Lee,et al.  Neural Temporal Difference and Q Learning Provably Converge to Global Optima , 2019, Mathematics of Operations Research.

[15]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[16]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[17]  Pieter Abbeel,et al.  Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[18]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[19]  Shaofeng Zou,et al.  Finite-Sample Analysis for SARSA with Linear Function Approximation , 2019, NeurIPS.

[20]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[21]  Matteo Hessel,et al.  Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[22]  K. Zhang,et al.  Convergent Reinforcement Learning with Function Approximation: A Bilevel Optimization Perspective , 2018 .

[23]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[24]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[25]  Huizhen Yu,et al.  On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning , 2017, ArXiv.

[26]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[27]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[28]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[29]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[32]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[33]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[34]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[35]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[36]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[37]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[38]  Marek Petrik,et al.  Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[39]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[40]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[41]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[42]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[43]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[44]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[45]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[46]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[47]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[48]  Benjamin Van Roy,et al.  Feature-based methods for large scale dynamic programming , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[49]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[50]  A. Tikhonov,et al.  Numerical Methods for the Solution of Ill-Posed Problems , 1995 .

[51]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[52]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[53]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[54]  C. A. Desoer,et al.  Nonlinear Systems Analysis , 1978 .

[55]  Diogo Carvalho,et al.  A new convergent variant of Q-learning with linear function approximation , 2020, NeurIPS.

[56]  Ruosong Wang,et al.  Agnostic Q-learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity , 2020 .

[57]  Ronald E. Parr,et al.  L1 Regularized Linear Temporal Difference Learning , 2012 .

[58]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[59]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[60]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.