Off-policy Multi-step Q-learning

In the past few years, off-policy reinforcement learning methods have shown promising results in their application for robot control. Deep Q-learning, however, still suffers from poor data-efficiency which is limiting with regard to real-world applications. We follow the idea of multi-step TD-learning to enhance data-efficiency while remaining off-policy by proposing two novel Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first n steps of a policy rollout and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. We prove that the combination of these short- and long-term predictions is a representation of the full return, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and compare our approach in the function-approximation setting with TD3, Model-based Value Expansion and TD3(Delta), which we introduce as an off-policy variant of TD(Delta). We show on three simulated robot tasks that Composite TD3 outperforms TD3 as well as state-of-the-art off-policy multi-step approaches in terms of data-efficiency.

[1]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[2]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[3]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[4]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[7]  Richard S. Sutton,et al.  Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target , 2019, ArXiv.

[8]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9]  Michael I. Jordan,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[10]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[11]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[13]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[14]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[15]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[16]  Romain Laroche,et al.  Hybrid Reward Architecture for Reinforcement Learning , 2017, NIPS.

[17]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[18]  Joelle Pineau,et al.  Separating value functions across time-scales , 2019, ICML 2019.

[19]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[20]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[21]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[22]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.