Accelerated Gradient Temporal Difference Learning

The family of temporal difference (TD) methods span a spectrum from computationally frugal linear methods like TD({\lambda}) to data efficient least squares methods. Least square methods make the best use of available data directly computing the TD solution and thus do not require tuning a typically highly sensitive learning rate parameter, but require quadratic computation and storage. Recent algorithmic developments have yielded several sub-quadratic methods that use an approximation to the least squares TD solution, but incur bias. In this paper, we propose a new family of accelerated gradient TD (ATD) methods that (1) provide similar data efficiency benefits to least-squares methods, at a fraction of the computation and storage (2) significantly reduce parameter sensitivity compared to linear TD methods, and (3) are asymptotically unbiased. We illustrate these claims with a proof of convergence in expectation and experiments on several benchmark domains and a large-scale industrial energy allocation domain.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  P. Hansen The discrete picard condition for discrete ill-posed problems , 1990 .

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[7]  Alborz Geramifard,et al.  iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.

[8]  Alborz Geramifard,et al.  Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.

[9]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[10]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[11]  Alessandro Lazaric,et al.  LSTD with Random Projections , 2010, NIPS.

[12]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[13]  Wen Zhang,et al.  Convergence of General Nonstationary Iterative Methods for Solving Singular Linear Equations , 2011, SIAM J. Matrix Anal. Appl..

[14]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[15]  D. Bertsekas,et al.  On the convergence of simulation-based iterative methods for solving singular linear systems , 2013 .

[16]  Daniel F. Salas,et al.  Benchmarking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy Storage Problems , 2013 .

[17]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[18]  Rémi Munos,et al.  Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic Control , 2013, ECML/PKDD.

[19]  Arash Givchi,et al.  Quasi Newton Temporal Difference Learning , 2014, ACML.

[20]  Aryan Mokhtari,et al.  RES: Regularized Stochastic BFGS Algorithm , 2014, IEEE Transactions on Signal Processing.

[21]  Philip S. Thomas,et al.  Natural Temporal Difference Learning , 2014, AAAI.

[22]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[23]  Richard S. Sutton,et al.  True Online TD(lambda) , 2014, ICML.

[24]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[25]  Hao Shen,et al.  Accelerated gradient temporal difference learning algorithms , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[26]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[27]  Martha White,et al.  Incremental Truncated LSTD , 2015, IJCAI.

[28]  Patrick M. Pilarski,et al.  True Online Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[29]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[30]  Martha White,et al.  Investigating Practical Linear Temporal Difference Learning , 2016, AAMAS.

[31]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.