Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

We propose Episodic Backward Update (EBU) - a novel deep reinforcement learning algorithm with a direct value propagation. In contrast to the conventional use of the experience replay with uniform random sampling, our agent samples a whole episode and successively propagates the value of a state to its previous states. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode. We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively.

[1]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[2]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4]  Yang Liu,et al.  Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening , 2016, ICLR.

[5]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[6]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[9]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[10]  Joel Z. Leibo,et al.  Model-Free Episodic Control , 2016, ArXiv.

[11]  Long Ji Lin,et al.  Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[12]  Marc G. Bellemare,et al.  Q(λ) with Off-Policy Corrections , 2016, ALT.

[13]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[14]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[15]  Charles Blundell,et al.  Fast deep reinforcement learning using online adjustments from the past , 2018, NeurIPS.

[16]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[19]  Peter Dayan,et al.  Hippocampal Contributions to Control: The Third Way , 2007, NIPS.

[20]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[21]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[22]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.