Zap Q-Learning With Nonlinear Function Approximation

The Zap stochastic approximation (SA) algorithm was introduced recently as a means to accelerate convergence in reinforcement learning algorithms. While numerical results were impressive, stability (in the sense of boundedness of parameter estimates) was established in only a few special cases. This class of algorithms is generalized in this paper, and stability is established under very general conditions. This general result can be applied to a wide range of algorithms found in reinforcement learning. Two classes are considered in this paper: (i)The natural generalization of Watkins' algorithm is not always stable in function approximation settings. Parameter estimates may diverge to infinity even in the \textit{linear} function approximation setting with a simple finite state-action MDP. Under mild conditions, the Zap SA algorithm provides a stable algorithm, even in the case of \textit{nonlinear} function approximation. (ii) The GQ algorithm of Maei et.~al.~2010 is designed to address the stability challenge. Analysis is provided to explain why the algorithm may be very slow to converge in practice. The new Zap GQ algorithm is stable even for nonlinear function approximation.

[1]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[3]  H. Robbins A Stochastic Approximation Method , 1951 .

[4]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[5]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[6]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[7]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[8]  S. Smale Convergent process of price adjust-ment and global newton methods , 1976 .

[9]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[10]  Shalabh Bhatnagar,et al.  Stability of Stochastic Approximations With “Controlled Markov” Noise and Temporal Difference Learning , 2015, IEEE Transactions on Automatic Control.

[11]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[12]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[13]  S. Liberty,et al.  Linear Systems , 2010, Scientific Parallel Computing.

[14]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[15]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[16]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[18]  Adithya M. Devraj,et al.  Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning , 2020, ArXiv.

[19]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[22]  M. Metivier,et al.  Applications of a Kushner and Clark lemma to general classes of stochastic algorithms , 1984, IEEE Trans. Inf. Theory.

[23]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[24]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[25]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[26]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[27]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[28]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[29]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[30]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[31]  G. Fort,et al.  Convergence of Markovian Stochastic Approximation with Discontinuous Dynamics , 2014, SIAM J. Control. Optim..

[32]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[33]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[34]  Sean P. Meyn,et al.  Zap Q-Learning - A User's Guide , 2019, 2019 Fifth Indian Control Conference (ICC).

[35]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[36]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[37]  Sean P. Meyn,et al.  A Liapounov bound for solutions of the Poisson equation , 1996 .

[38]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[39]  Lawrence C. Evans,et al.  Weak convergence methods for nonlinear partial differential equations , 1990 .

[40]  V. Borkar,et al.  A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.

[41]  Wray L. Buntine,et al.  Computing second derivatives in feed-forward networks: a review , 1994, IEEE Trans. Neural Networks.

[42]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[43]  Shalabh Bhatnagar,et al.  A Generalization of the Borkar-Meyn Theorem for Stochastic Recursive Inclusions , 2015, Math. Oper. Res..

[44]  Sean P. Meyn,et al.  Zap Q-Learning , 2017, NIPS.

[45]  László Gerencsér,et al.  Convergence rate of moments in stochastic approximation with simultaneous perturbation gradient approximation and resetting , 1999, IEEE Trans. Autom. Control..

[46]  Shalabh Bhatnagar,et al.  Dynamics of stochastic approximation with Markov iterate-dependent noise with the stability of the iterates not ensured , 2016 .

[47]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[48]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[49]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[50]  F. Clarke Functional Analysis, Calculus of Variations and Optimal Control , 2013 .

[51]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[52]  D. Ruppert A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[53]  Michael I. Jordan,et al.  Acceleration via Symplectic Discretization of High-Resolution Differential Equations , 2019, NeurIPS.

[54]  Thinh T. Doan,et al.  Performance of Q-learning with Linear Function Approximation: Stability and Finite-Time Analysis , 2019 .

[55]  Shalabh Bhatnagar,et al.  Two Timescale Stochastic Approximation with Controlled Markov noise , 2015, Math. Oper. Res..

[56]  Yoram Singer,et al.  Second Order Optimization Made Practical , 2020, ArXiv.

[57]  Santiago Zazo,et al.  Diffusion gradient temporal difference for cooperative reinforcement learning with linear function approximation , 2012, 2012 3rd International Workshop on Cognitive Information Processing (CIP).

[58]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[59]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[60]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[61]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[62]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[63]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[64]  P. Olver Nonlinear Systems , 2013 .

[65]  Shie Mannor,et al.  Concentration Bounds for Two Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, ArXiv.

[66]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[67]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[68]  Magnus Egerstedt,et al.  Performance regulation and tracking via lookahead simulation: Preliminary results and validation , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[69]  Ana Busic,et al.  Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation , 2020, AISTATS.

[70]  Magnus Egerstedt,et al.  Tracking Control by the Newton-Raphson Flow: Applications to Autonomous Vehicles , 2019, 2019 18th European Control Conference (ECC).

[71]  Shalabh Bhatnagar,et al.  A stability criterion for two timescale stochastic approximation schemes , 2017, Autom..

[72]  Eric Moulines,et al.  Stability of Stochastic Approximation under Verifiable Conditions , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[73]  Vivek S. Borkar,et al.  Concentration bounds for two time scale stochastic approximation , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[74]  Sean P. Meyn,et al.  Fastest Convergence for Q-learning , 2017, ArXiv.