Active deep Q-learning with demonstration

Reinforcement learning (RL) is a machine learning technique aiming to learn how to take actions in an environment to maximize some kind of reward. Recent research has shown that although the learning efficiency of RL can be improved with expert demonstration, it usually takes considerable efforts to obtain enough demonstration. The efforts prevent training decent RL agents with expert demonstration in practice. In this work, we propose Active Reinforcement Learning with Demonstration, a new framework to streamline RL in terms of demonstration efforts by allowing the RL agent to query for demonstration actively during training. Under the framework, we propose Active deep Q-Network, a novel query strategy based on a classical RL algorithm called deep Q-network (DQN). The proposed algorithm dynamically estimates the uncertainty of recent states and utilizes the queried demonstration data by optimizing a supervised loss in addition to the usual DQN loss. We propose two methods of estimating the uncertainty based on two state-of-the-art DQN models, namely the divergence of bootstrapped DQN and the variance of noisy DQN. The empirical results validate that both methods not only learn faster than other passive expert demonstration methods with the same amount of demonstration and but also reach super-expert level of performance across four different tasks.

[1]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[2]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[3]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[4]  Traian Rebedea,et al.  Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay , 2016, ArXiv.

[5]  Andrea Lockerd Thomaz,et al.  Exploration from Demonstration for Interactive Reinforcement Learning , 2016, AAMAS.

[6]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[7]  Rajesh P. N. Rao,et al.  Active Imitation Learning , 2007, AAAI.

[8]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[9]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Jianfeng Gao,et al.  Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking , 2016, ArXiv.

[12]  Sonia Chernova,et al.  Reinforcement Learning from Demonstration through Shaping , 2015, IJCAI.

[13]  Bartosz Krawczyk,et al.  Online query by committee for active learning from drifting data streams , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[14]  Matthieu Geist,et al.  Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[15]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[16]  Jiashi Feng,et al.  Policy Optimization with Demonstrations , 2018, ICML.

[17]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[18]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[19]  Thomas G. Dietterich,et al.  Active lmitation learning: formal and practical reductions to I.I.D. learning , 2014, J. Mach. Learn. Res..

[20]  Alex M. Andrew,et al.  ROBOT LEARNING, edited by Jonathan H. Connell and Sridhar Mahadevan, Kluwer, Boston, 1993/1997, xii+240 pp., ISBN 0-7923-9365-1 (Hardback, 218.00 Guilders, $120.00, £89.95). , 1999, Robotica (Cambridge. Print).

[21]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[22]  Sonia Chernova,et al.  Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[23]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[24]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[25]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[26]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[27]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[28]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[29]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[30]  Alborz Geramifard,et al.  RLPy: a value-function-based reinforcement learning framework for education and research , 2015, J. Mach. Learn. Res..

[31]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[32]  Sonia Chernova,et al.  Learning from Demonstration for Shaping through Inverse Reinforcement Learning , 2016, AAMAS.

[33]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[34]  Matthew E. Taylor,et al.  Improving Reinforcement Learning with Confidence-Based Demonstrations , 2017, IJCAI.

[35]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[36]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .