论文信息 - Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy - 字舞流文

Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy

This paper proposes a method for learning a trajectory-conditioned policy to imitate diverse demonstrations from the agent’s own past experiences. We demonstrate that such self-imitation drives exploration in diverse directions and increases the chance of finding a globally optimal solution in reinforcement learning problems, especially when the reward is sparse and deceptive. Our method significantly outperforms existing self-imitation learning and count-based exploration methods on various sparse-reward reinforcement learning tasks with local optima. In particular, we report a state-of-the-art score of more than 25,000 points on Montezuma’s Revenge without using expert demonstrations or resetting to arbitrary states.

Samy Bengio | Honglak Lee | Mohammad Norouzi | Jongwook Choi | Marcin Moczulski | Yijie Guo | Samy Bengio | Honglak Lee | Mohammad Norouzi | Yijie Guo | Jongwook Choi | Marcin Moczulski

[1] Stefanie Tellex,et al. Deep Abstract Q-Networks , 2017, AAMAS.

[2] Jürgen Schmidhuber,et al. Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[3] Jitendra Malik,et al. Combining self-supervised learning and imitation for vision-based rope manipulation , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[4] Sebastian Thrun,et al. Active Exploration in Dynamic Environments , 1991, NIPS.

[5] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.

[6] Pierre-Yves Oudeyer,et al. What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[7] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[8] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[9] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[10] Alexei A. Efros,et al. Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[11] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12] Nando de Freitas,et al. Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[13] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[14] Rémi Munos,et al. Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[15] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[16] Sergey Levine,et al. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[17] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[18] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[19] Satinder Singh,et al. Generative Adversarial Self-Imitation Learning , 2018, ArXiv.

[20] Satinder Singh,et al. Self-Imitation Learning , 2018, ICML.

[21] Honglak Lee,et al. Contingency-Aware Exploration in Reinforcement Learning , 2018, ICLR.

[22] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[23] Tim Salimans,et al. Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[24] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[25] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[27] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[28] Lihong Li,et al. Explicit Recall for Efficient Exploration , 2018 .

[29] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[30] Jeff Clune,et al. Deep Curiosity Search: Intra-Life Exploration Improves Performance on Challenging Deep Reinforcement Learning Problems , 2018, ArXiv.

[31] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[32] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[33] J. Urgen Schmidhuber,et al. Adaptive confidence and adaptive curiosity , 1991, Forschungsberichte, TU Munich.

[34] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[35] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36] Sergey Levine,et al. Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[37] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[38] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[39] Emma Brunskill,et al. Learning Abstract Models for Long-Horizon Exploration , 2018 .

[40] Daan Wierstra,et al. Variational Intrinsic Control , 2016, ICLR.

[41] Marcin Andrychowicz,et al. One-Shot Imitation Learning , 2017, NIPS.

[42] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[43] John Langford,et al. Efficient Exploration in Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[44] Kenneth O. Stanley,et al. Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[45] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[46] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[47] Qiang Liu,et al. Learning Self-Imitating Diverse Policies , 2018, ICLR.

[48] Jitendra Malik,et al. Zero-Shot Visual Imitation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).