Multi-step Off-policy Learning Without Importance Sampling Ratios

To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference (TD) learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update. A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of our algorithm. In two challenging off-policy tasks, we demonstrate that our algorithm is stable, effectively avoids the large variance issue, and can perform substantially better than its state-of-the-art counterpart.

[1]  R. Sutton,et al.  A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[2]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[3]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[4]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[5]  Shie Mannor,et al.  Adaptive Lambda Least-Squares Temporal Difference Learning , 2016, 1612.09465.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[8]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[9]  Martha White,et al.  Investigating Practical Linear Temporal Difference Learning , 2016, AAMAS.

[10]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[11]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[12]  Huizhen Yu,et al.  Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize , 2015, J. Mach. Learn. Res..

[13]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[14]  Richard S. Sutton,et al.  Scaling life-long off-policy learning , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[15]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[16]  Marc G. Bellemare,et al.  Q(λ) with Off-Policy Corrections , 2016, ALT.

[17]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[18]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[19]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[20]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[21]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[22]  Richard S. Sutton,et al.  True online TD(λ) , 2014, ICML 2014.

[23]  Patrick M. Pilarski,et al.  True Online Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[24]  Adam M White,et al.  DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[25]  Martha White,et al.  Emphatic Temporal-Difference Learning , 2015, ArXiv.

[26]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[27]  Hado Philip van Hasselt,et al.  Insights in reinforcement rearning : formal analysis and empirical evaluation of temporal-difference learning algorithms , 2011 .

[28]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[29]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[30]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[31]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[32]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[33]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[34]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[35]  J. Hammersley,et al.  Monte Carlo Methods , 1966 .

[36]  Nan Jiang,et al.  Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[37]  Marc G. Bellemare,et al.  Q($\lambda$) with Off-Policy Corrections , 2016 .

[38]  Shie Mannor,et al.  Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[39]  Hado van Hasselt,et al.  Insights in reinforcement ;learning: Formal analysis end empirical evaluation of temporal-difference learning , 2010 .

[40]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[41]  Richard S. Sutton,et al.  Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[42]  Martha White,et al.  A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[43]  Shie Mannor,et al.  Off-policy Model-based Learning under Unknown Factored Dynamics , 2015, ICML.

[44]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..