Multi-step off-policy reinforcement learning has achieved great success. However, existing multistep methods usually impose a fixed prior on the bootstrap steps, while the off-policy methods often require additional correction, suffering from certain undesired effects. In this paper, we propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps. The new method has two desired properties:1) it can flexibly adjust the bootstrap step based on the quality of the data and the learned value function; 2) it can safely and robustly utilize data from arbitrary behavior policy without additional correction, whatever its quality or “off-policyness”. We analyze the theoretical properties of the related operator, showing that it is able to converge to the global optimal value function, with a ratio faster than the traditional Bellman Optimality Operator. Furthermore, based on this new operator, we derive new model-free RL algorithms named Greedy MultiStep Q Learning (and Greedy Multi-step DQN). Experiments reveal that the proposed methods are reliable, easy to implement, and achieves stateof-the-art performance on a series of standard benchmark datasets.
[1]
Marc G. Bellemare,et al.
Safe and Efficient Off-Policy Reinforcement Learning
,
2016,
NIPS.
[2]
Hado van Hasselt,et al.
Double Q-learning
,
2010,
NIPS.
[3]
Balaraman Ravindran,et al.
Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning
,
2017,
ArXiv.
[4]
Marc G. Bellemare,et al.
Q(λ) with Off-Policy Corrections
,
2016,
ALT.
[5]
Martha White,et al.
Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
,
2020,
ICLR.
[6]
Shie Mannor,et al.
Beyond the One Step Greedy Approach in Reinforcement Learning
,
2018,
ICML.
[7]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[8]
Doina Precup,et al.
Eligibility Traces for Off-Policy Policy Evaluation
,
2000,
ICML.
[9]
Richard S. Sutton,et al.
Multi-step Reinforcement Learning: A Unifying Algorithm
,
2017,
AAAI.
[10]
Sergey Levine,et al.
High-Dimensional Continuous Control Using Generalized Advantage Estimation
,
2015,
ICLR.
[11]
David Budden,et al.
Distributed Prioritized Experience Replay
,
2018,
ICLR.
[12]
Shie Mannor,et al.
How to Combine Tree-Search Methods in Reinforcement Learning
,
2018,
AAAI.