TD Convergence: An Optimization Perspective

We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning.

[1]  Do Wan Kim,et al.  Regularized Q-learning , 2022, ArXiv.

[2]  Alexander J. Smola,et al.  Faster Deep Reinforcement Learning with Slower Online Network , 2021, NeurIPS.

[3]  Clare Lyle,et al.  On The Effect of Auxiliary Tasks on Representation Dynamics , 2021, AISTATS.

[4]  Hengshuai Yao,et al.  Breaking the Deadly Triad with a Target Network , 2021, ICML.

[5]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints , 2021, NeurIPS.

[6]  Gergely Neu,et al.  Logistic $Q$-Learning , 2020, AISTATS.

[7]  Marc G. Bellemare,et al.  Representations for Stable Off-Policy Reinforcement Learning , 2020, ICML.

[8]  Adam White,et al.  Gradient Temporal-Difference Learning with Regularized Corrections , 2020, ICML.

[9]  S. Kakade,et al.  FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[10]  Siva Theja Maguluri,et al.  Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes , 2020, ArXiv.

[11]  Shalabh Bhatnagar,et al.  A Convergent Off-Policy Temporal Difference Algorithm , 2019, ECAI.

[12]  Ana Busic,et al.  Zap Q-Learning With Nonlinear Function Approximation , 2019, NeurIPS.

[13]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[14]  Qiang Liu,et al.  A Kernel Loss for Solving the Bellman Equation , 2019, NeurIPS.

[15]  J. Lee,et al.  Neural Temporal-Difference Learning Converges to Global Optima , 2019, NeurIPS.

[16]  Donghwan Lee,et al.  Target-Based Temporal Difference Learning , 2019, ICML.

[17]  Shaofeng Zou,et al.  Finite-Sample Analysis for SARSA with Linear Function Approximation , 2019, NeurIPS.

[18]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[19]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[20]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[21]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[22]  Richard S. Sutton,et al.  A First Empirical Study of Emphatic Temporal Difference Learning , 2017, ArXiv.

[23]  J. Z. Kolter,et al.  Input Convex Neural Networks , 2016, ICML.

[24]  Marek Petrik,et al.  Proximal Gradient Temporal Difference Learning Algorithms , 2016, IJCAI.

[25]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[28]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[31]  Dan Lizotte,et al.  Convergent Fitted Value Iteration with Linear Function Approximation , 2011, NIPS.

[32]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[33]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[34]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[35]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[36]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[37]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[38]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[39]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[40]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[41]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[42]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[43]  Doina Precup,et al.  Exponentiated Gradient Methods for Reinforcement Learning , 1997, ICML.

[44]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[45]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[46]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[47]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[48]  D. Schuurmans,et al.  Understanding and Leveraging Overparameterization in Recursive Value Estimation , 2022, ICLR.

[49]  J. Z. Kolter,et al.  The Pitfalls of Regularization in Off-Policy TD Learning , 2022, NeurIPS.

[50]  Ruosong Wang,et al.  Agnostic Q-learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity , 2020 .

[51]  Ehsan Saleh Deterministic Bellman Residual Minimization , 2019 .

[52]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[53]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[54]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.