论文信息 - Off-policy Multi-step Q-learning

Off-policy Multi-step Q-learning

In the past few years, off-policy reinforcement learning methods have shown promising results in their application for robot control. Deep Q-learning, however, still suffers from poor data-efficiency which is limiting with regard to real-world applications. We follow the idea of multi-step TD-learning to enhance data-efficiency while remaining off-policy by proposing two novel Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first n steps of a policy rollout and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. We prove that the combination of these short- and long-term predictions is a representation of the full return, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and compare our approach in the function-approximation setting with TD3, Model-based Value Expansion and TD3(Delta), which we introduce as an off-policy variant of TD(Delta). We show on three simulated robot tasks that Composite TD3 outperforms TD3 as well as state-of-the-art off-policy multi-step approaches in terms of data-efficiency.

Gabriel Kalweit | Maria Huegle | Joschka Boedecker

[1] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[2] Sergey Levine,et al. Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[3] Richard S. Sutton,et al. Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[4] Martin A. Riedmiller,et al. Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[5] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[6] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[7] Richard S. Sutton,et al. Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target , 2019, ArXiv.

[8] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9] Michael I. Jordan,et al. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[10] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[11] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[13] Susan A. Murphy,et al. A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[14] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[15] Matthew W. Hoffman,et al. Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[16] Romain Laroche,et al. Hybrid Reward Architecture for Reinforcement Learning , 2017, NIPS.

[17] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[18] Joelle Pineau,et al. Separating value functions across time-scales , 2019, ICML 2019.

[19] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[20] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[21] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[22] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.