论文信息 - Deep Q-learning From Demonstrations - 字舞流文

Deep Q-learning From Demonstrations

Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator's actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.

Tom Schaul | Andrew Sendonaris | Joel Z. Leibo | Olivier Pietquin | Gabriel Dulac-Arnold | Ian Osband | Todd Hester | Marc Lanctot | Bilal Piot | Dan Horgan | Matej Vecerík | John Quan | John Agapiou | Audrunas Gruslys | Dan Horgan | J. Agapiou | T. Schaul | Matej Vecerík | Ian Osband | Bilal Piot | John Quan | Marc Lanctot | A. Gruslys | Todd Hester | O. Pietquin | A. Sendonaris | Gabriel Dulac-Arnold

[1] Matthieu Geist,et al. Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[2] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[3] Alessandro Lazaric,et al. Direct Policy Iteration with Demonstrations , 2015, IJCAI.

[4] Joelle Pineau,et al. Learning from Limited Demonstrations , 2013, NIPS.

[5] Martin A. Riedmiller,et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[6] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[7] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[8] Matthieu Geist,et al. Learning from Demonstrations: Is It Worth Estimating a Reward Function? , 2013, ECML/PKDD.

[9] Michael H. Bowling,et al. Apprenticeship learning using linear programming , 2008, ICML '08.

[10] Tsuyoshi Murata,et al. {m , 1934, ACML.

[11] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[12] Marc G. Bellemare,et al. The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[13] Byron Boots,et al. Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[14] A. P. Hyper-parameters. Count-Based Exploration with Neural Density Models , 2017 .

[15] Peter Stone,et al. TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[16] Tom Schaul,et al. Learning from Demonstrations for Real World Reinforcement Learning , 2017, ArXiv.

[17] Yuxi Li,et al. Deep Reinforcement Learning , 2018, Reinforcement Learning for Cyber-Physical Systems.

[18] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[19] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20] Matthieu Geist,et al. Boosted and reward-regularized classification for apprenticeship learning , 2014, AAMAS.

[21] Sonia Chernova,et al. Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[22] Pieter Abbeel,et al. An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[23] David Silver,et al. Learning values across many orders of magnitude , 2016, NIPS.

[24] Jianfeng Gao,et al. Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking , 2016, ArXiv.

[25] Sonia Chernova,et al. Reinforcement Learning from Demonstration through Shaping , 2015, IJCAI.

[26] Michael L. Littman,et al. Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[27] Marcin Andrychowicz,et al. One-Shot Imitation Learning , 2017, NIPS.

[28] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[29] Andrea Lockerd Thomaz,et al. Exploration from Demonstration for Interactive Reinforcement Learning , 2016, AAMAS.

[30] Stefan Schaal,et al. Learning from Demonstration , 1996, NIPS.

[31] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[32] Guy Shani,et al. An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[33] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[34] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[35] Traian Rebedea,et al. Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay , 2016, ArXiv.

[36] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[37] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[38] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[39] Andrea Lockerd Thomaz,et al. Policy Shaping with Human Teachers , 2015, IJCAI.

[40] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[41] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[42] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[43] Moshe Dor,et al. אבן, and: Stone , 2017 .

[44] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[45] Sonia Chernova,et al. Learning from Demonstration for Shaping through Inverse Reinforcement Learning , 2016, AAMAS.

[46] Robert E. Schapire,et al. A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.