论文信息 - Multi-step Off-policy Learning Without Importance Sampling Ratios

Multi-step Off-policy Learning Without Importance Sampling Ratios

To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference (TD) learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update. A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of our algorithm. In two challenging off-policy tasks, we demonstrate that our algorithm is stable, effectively avoids the large variance issue, and can perform substantially better than its state-of-the-art counterpart.

[1] R. Sutton,et al. A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .

[2] Huizhen Yu,et al. On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[3] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[4] Matthieu Geist,et al. Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[5] Shie Mannor,et al. Adaptive Lambda Least-Squares Temporal Difference Learning , 2016, 1612.09465.

[6] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[8] Huizhen Yu,et al. Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[9] Martha White,et al. Investigating Practical Linear Temporal Difference Learning , 2016, AAMAS.

[10] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[11] Jun S. Liu,et al. Monte Carlo strategies in scientific computing , 2001 .

[12] Huizhen Yu,et al. Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize , 2015, J. Mach. Learn. Res..

[13] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[14] Richard S. Sutton,et al. Scaling life-long off-policy learning , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[15] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[16] Marc G. Bellemare,et al. Q(λ) with Off-Policy Corrections , 2016, ALT.

[17] Doina Precup,et al. A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[18] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[19] Richard S. Sutton,et al. Off-policy TD( l) with a true online equivalence , 2014, UAI.

[20] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[21] Philip S. Thomas,et al. Safe Reinforcement Learning , 2015 .

[22] Richard S. Sutton,et al. True online TD(λ) , 2014, ICML 2014.

[23] Patrick M. Pilarski,et al. True Online Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[24] Adam M White,et al. DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[25] Martha White,et al. Emphatic Temporal-Difference Learning , 2015, ArXiv.

[26] Richard S. Sutton,et al. Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.