Composite Q-learning: Multi-scale Q-function Decomposition and Separable Optimization

In the past few years, off-policy reinforcement learning methods have shown promising results in their application for robot control. Deep Q-learning, however, still suffers from poor data-efficiency and is susceptible to stochasticity in the environment or reward functions which is limiting with regard to real-world applications. We alleviate these problems by proposing two novel off-policy Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first n steps of a target-policy rollout w.r.t. the full action-value and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. This decomposition allows us to optimize both parts with their individual learning rates, achieving significant learning speedup. We prove that the combination of these short- and long-term predictions is a representation of the full return, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and compare Deep Composite Q-learning with TD3 and TD3(Delta), which we introduce as an off-policy variant of TD(Delta). Moreover, we show that Composite TD3 outperforms TD3 as well as state-of-the-art compositional Q-learning approaches significantly in terms of data-efficiency in multiple simulated robot tasks and that Composite Q-learning is robust to stochastic environments and reward functions.

[1]  Joelle Pineau,et al.  Separating value functions across time-scales , 2019, ICML 2019.

[2]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[3]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[4]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Richard S. Sutton,et al.  Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning , 2020, AAAI.

[6]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[7]  Richard S. Sutton,et al.  Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target , 2019, ArXiv.

[8]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[10]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[11]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[12]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[13]  Martha White,et al.  General Value Function Networks , 2018, J. Artif. Intell. Res..

[14]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[15]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[16]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[17]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[18]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[21]  Gabriel Kalweit,et al.  Interpretable Multi Time-scale Constraints in Model-free Deep Reinforcement Learning for Autonomous Driving , 2020, ArXiv.

[22]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.