Q(λ) with Off-Policy Corrections

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(\(\lambda \)). We illustrate this theoretical relationship empirically on a continuous-state control task.

[1]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[2]  Richard S. Sutton,et al.  Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[3]  H. Kahn,et al.  Methods of Reducing Sample Size in Monte Carlo Computations , 1953, Oper. Res..

[4]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[7]  Richard S. Sutton,et al.  True online TD(λ) , 2014, ICML 2014.

[8]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[9]  Hado Philip van Hasselt,et al.  Insights in reinforcement rearning : formal analysis and empirical evaluation of temporal-difference learning algorithms , 2011 .

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[12]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[13]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[14]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  R. Sutton,et al.  A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[17]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[18]  Michael Kearns,et al.  Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[19]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[20]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[21]  Shie Mannor,et al.  Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[22]  Hado van Hasselt,et al.  Insights in reinforcement ;learning: Formal analysis end empirical evaluation of temporal-difference learning , 2010 .

[23]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[24]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[25]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[26]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[27]  Martha White,et al.  Emphatic Temporal-Difference Learning , 2015, ArXiv.

[28]  Richard S. Sutton,et al.  True Online TD(lambda) , 2014, ICML.