An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(γ)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD(γ), and GQ(λ). Compared to these methods, our emphatic TD(λ) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  J. Gillis,et al.  Matrix Iterative Analysis , 1961 .

[3]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[4]  A. Klopf A neuronal model of classical conditioning , 1988 .

[5]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[6]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[7]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[8]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[9]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[10]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[11]  Geoffrey J. Gordon Stable Fitted Reinforcement Learning , 1995, NIPS.

[12]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[15]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[16]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[17]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[18]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[19]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[20]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[21]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[22]  Roland E. Suri,et al.  TD models of reward predictive responses in dopamine neurons , 2002, Neural Networks.

[23]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[24]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[25]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[26]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[27]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[28]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[29]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[32]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.

[33]  Y. Niv,et al.  Dialogues on prediction errors , 2008, Trends in Cognitive Sciences.

[34]  P. Dayan,et al.  Reinforcement learning: The Good, The Bad and The Ugly , 2008, Current Opinion in Neurobiology.

[35]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[36]  R. Sutton The Grand Challenge of Predictive Empirical Abstract Knowledge , 2009 .

[37]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[38]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[39]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[40]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[41]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[42]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[43]  Richard S. Sutton,et al.  Beyond Reward: The Problem of Knowledge and Data , 2011, ILP.

[44]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[45]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[46]  J. O'Doherty,et al.  Beyond simple reinforcement learning: the computational neurobiology of reward‐learning and valuation , 2012, The European journal of neuroscience.

[47]  Elliot A. Ludvig,et al.  Evaluating the TD model of classical conditioning , 2012, Learning & behavior.

[48]  Leah M Hackman,et al.  Faster Gradient-TD Algorithms , 2013 .

[49]  Dimitri P. Bertsekas,et al.  Stabilization of Stochastic Iterative Methods for Singular and Nearly Singular Linear Systems , 2014, Math. Oper. Res..

[50]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[51]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[52]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[53]  Richard S. Sutton,et al.  True online TD(λ) , 2014, ICML 2014.

[54]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[55]  R. Sutton,et al.  A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[56]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[57]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[58]  Richard S. Sutton,et al.  True Online TD(lambda) , 2014, ICML.

[59]  Richard S. Sutton,et al.  True Online Emphatic TD(λ): Quick Reference and Implementation Guide , 2015, ArXiv.

[60]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[61]  Adam M White,et al.  DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[62]  Richard S. Sutton,et al.  Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[63]  Richard S. Sutton,et al.  True Online Emphatic TD($\lambda$): Quick Reference and Implementation Guide , 2015 .

[64]  Martha White,et al.  Emphatic Temporal-Difference Learning , 2015, ArXiv.

[65]  Huizhen Yu,et al.  Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize , 2015, J. Mach. Learn. Res..

[66]  Shie Mannor,et al.  Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.