论文信息 - Safe and Efficient Off-Policy Reinforcement Learning

Safe and Efficient Off-Policy Reinforcement Learning

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace($\lambda$), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q($\lambda$), which was an open problem since 1989. We illustrate the benefits of Retrace($\lambda$) on a standard suite of Atari 2600 games.

[1] Long Ji Lin,et al. Scaling Up Reinforcement Learning for Robot Control , 1993, International Conference on Machine Learning.

[2] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[4] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[5] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[7] Michael Kearns,et al. Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[8] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[9] John N. Tsitsiklis,et al. On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[10] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[13] Matthieu Geist,et al. Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[14] Lihong Li,et al. Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[15] Richard S. Sutton,et al. Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[16] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[17] Martha White,et al. Emphatic Temporal-Difference Learning , 2015, ArXiv.

[18] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[19] Marc G. Bellemare,et al. Q(λ) with Off-Policy Corrections , 2016, ALT.

[20] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[21] Shie Mannor,et al. Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[22] Marc G. Bellemare,et al. Q($\lambda$) with Off-Policy Corrections , 2016 .

[23] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.