论文信息 - Greedy Multi-step Off-Policy Reinforcement Learning

Greedy Multi-step Off-Policy Reinforcement Learning

Multi-step off-policy reinforcement learning has achieved great success. However, existing multistep methods usually impose a fixed prior on the bootstrap steps, while the off-policy methods often require additional correction, suffering from certain undesired effects. In this paper, we propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps. The new method has two desired properties:1) it can flexibly adjust the bootstrap step based on the quality of the data and the learned value function; 2) it can safely and robustly utilize data from arbitrary behavior policy without additional correction, whatever its quality or “off-policyness”. We analyze the theoretical properties of the related operator, showing that it is able to converge to the global optimal value function, with a ratio faster than the traditional Bellman Optimality Operator. Furthermore, based on this new operator, we derive new model-free RL algorithms named Greedy MultiStep Q Learning (and Greedy Multi-step DQN). Experiments reveal that the proposed methods are reliable, easy to implement, and achieves stateof-the-art performance on a series of standard benchmark datasets.

Xiaoyang Tan | Yuhui Wang | Pengcheng He

[1] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[2] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[3] Balaraman Ravindran,et al. Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning , 2017, ArXiv.

[4] Marc G. Bellemare,et al. Q(λ) with Off-Policy Corrections , 2016, ALT.

[5] Martha White,et al. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[6] Shie Mannor,et al. Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[7] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9] Richard S. Sutton,et al. Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[10] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[11] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[12] Shie Mannor,et al. How to Combine Tree-Search Methods in Reinforcement Learning , 2018, AAAI.