Accelerating Reinforcement Learning with Suboptimal Guidance

Reinforcement Learning in domains with sparse rewards is a difficult problem, and a large part of the training process is often spent searching the state space in a more or less random fashion for any learning signals. For control problems, we often have some controller readily available which might be suboptimal but nevertheless solves the problem to some degree. This controller can be used to guide the initial exploration phase of the learning controller towards reward yielding states, reducing the time before refinement of a viable policy can be initiated. In our work, the agent is guided through an auxiliary behaviour cloning loss which is made conditional on a Q-filter, i.e. it is only applied in situations where the critic deems the guiding controller to be better than the agent. The Q-filter provides a natural way to adjust the guidance throughout the training process, allowing the agent to exceed the guiding controller in a manner that is adaptive to the task at hand and the proficiency of the guiding controller. The contribution of this paper lies in identifying shortcomings in previously proposed implementations of the Q-filter concept, and in suggesting some ways these issues can be mitigated. These modifications are tested on the OpenAI Gym Fetch environments, showing clear improvements in adaptivity and yielding increased performance in all robotic environments tested.

[1]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[3]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[4]  Sen Wang,et al.  Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[6]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[7]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[8]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[10]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[11]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.