论文信息 - Backplay: "Man muss immer umkehren"

Backplay: "Man muss immer umkehren"

Model-free reinforcement learning (RL) requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to improve the sample efficiency when we have access to demonstrations. Our approach, Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment's fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. Our contributions are that we analytically characterize the types of environments where Backplay can improve training speed, demonstrate the effectiveness of Backplay both in large grid worlds and a complex four player zero-sum game (Pommerman), and show that Backplay compares favorably to other competitive methods known to improve sample efficiency. This includes reward shaping, behavioral cloning, and reverse curriculum generation.

[1] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[2] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[3] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[4] P. Bó. Cooperation under the Shadow of the Future: Experimental Evidence from Infinitely Repeated Games , 2005 .

[5] Raymond J. Dolan,et al. Game Theory of Mind , 2008, PLoS Comput. Biol..

[6] John Langford,et al. Search-based structured prediction , 2009, Machine Learning.

[7] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[8] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[9] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[10] Guan-Yu Chen,et al. On the mixing time and spectral gap for birth and death chains , 2013, 1304.4346.

[11] W. Hong,et al. A note on the passage time of finite-state Markov chains , 2013, 1302.5987.

[12] Joshua B. Tenenbaum,et al. Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction , 2016, CogSci.

[13] Jianfeng Gao,et al. Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[14] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[15] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[16] Traian Rebedea,et al. Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay , 2016, ArXiv.

[17] Anca D. Dragan,et al. SHIV: Reducing supervisor burden in DAgger using support vectors for efficient learning from demonstrations in high dimensional state spaces , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[18] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[19] Kyunghyun Cho,et al. Query-Efficient Imitation Learning for End-to-End Autonomous Driving , 2016, ArXiv.

[20] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[21] Kevin Waugh,et al. DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker , 2017, ArXiv.

[22] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[23] Spyridon Samothrakis,et al. On Monte Carlo Tree Search and Reinforcement Learning , 2017, J. Artif. Intell. Res..

[24] Alexander Peysakhovich,et al. Maintaining cooperation in complex social dilemmas using deep reinforcement learning , 2017, ArXiv.

[25] Pieter Abbeel,et al. Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[26] Tuomas Sandholm,et al. Safe and Nested Subgame Solving for Imperfect-Information Games , 2017, NIPS.

[27] David Silver,et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[28] Stefan Lee,et al. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29] Joel Z. Leibo,et al. Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[30] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[31] Sergey Levine,et al. DeepMimic , 2018, ACM Trans. Graph..

[32] Shimon Whiteson,et al. Learning with Opponent-Learning Awareness , 2017, AAMAS.

[33] Julian Togelius,et al. Pommerman: A Multi-Agent Playground , 2018, AIIDE Workshops.

[34] Alexander Peysakhovich,et al. Learning Social Conventions in Markov Games , 2018, ArXiv.

[35] Alexander Peysakhovich,et al. Consequentialist conditional cooperation in social dilemmas with imperfect information , 2017, AAAI Workshops.

[36] Tim Salimans,et al. Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[37] Nando de Freitas,et al. Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[38] Alexander Peysakhovich,et al. Learning Existing Social Conventions in Markov Games , 2018, 1806.10071.

[39] Pierre Baldi,et al. Solving the Rubik's Cube Without Human Knowledge , 2018, ArXiv.

[40] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[41] Yang Gao,et al. Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[42] Nando de Freitas,et al. Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[43] Julian Togelius,et al. A hybrid search agent in pommerman , 2018, FDG.

[44] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[45] Stefan Lee,et al. Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[46] Ashley D. Edwards,et al. Forward-Backward Reinforcement Learning , 2018, ArXiv.

[47] Andrew Zisserman,et al. Kickstarting Deep Reinforcement Learning , 2018, ArXiv.

[48] Xi Chen,et al. Learning From Demonstration in the Wild , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[49] Sergey Levine,et al. Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[50] Mo Chen,et al. BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[51] Tsuyoshi Murata,et al. {m , 1934, ACML.