论文信息 - BeBold: Exploration Beyond the Boundary of Explored Regions

BeBold: Exploration Beyond the Boundary of Explored Regions

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. To guide exploration, previous work makes extensive use of intrinsic reward (IR). There are many heuristics for IR, including visitation counts, curiosity, and state-difference. In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR. The criterion helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment. The resulting method, BeBold, solves the 12 most challenging procedurally-generated tasks in MiniGrid with just 120M environment steps, without any curriculum learning. In comparison, the previous SoTA only solves 50% of the tasks. BeBold also achieves SoTA on multiple tasks in NetHack, a popular rogue-like game that contains more challenging procedurally-generated environments.

[1] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[2] Alexei A. Efros,et al. Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[3] Wojciech Jaskowski,et al. Model-Based Active Exploration , 2018, ICML.

[4] Allan Jabri,et al. Unsupervised Curricula for Visual Meta-Reinforcement Learning , 2019, NeurIPS.

[5] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[6] John Schulman,et al. Teacher–Student Curriculum Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[7] Daniel Guo,et al. Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[8] S. Shankar Sastry,et al. Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning , 2017, ArXiv.

[9] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[10] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[11] Joel Lehman,et al. First return then explore , 2020, ArXiv.

[12] Kenneth O. Stanley,et al. Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[13] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[14] Pierre-Yves Oudeyer,et al. Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[15] Anirudh Topiwala,et al. Frontier Based Exploration for Autonomous Robot , 2018, ArXiv.

[16] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[17] Jürgen Schmidhuber,et al. PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem , 2011, Front. Psychol..

[18] Daniel Guo,et al. Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[19] Murray Shanahan,et al. Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[20] Sergey Levine,et al. EMI: Exploration with Mutual Information , 2018, ICML.

[21] Jürgen Schmidhuber,et al. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[22] Wolfram Burgard,et al. Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration , 2019, ArXiv.

[23] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[24] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[25] Kevin Waugh,et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[26] Sergey Levine,et al. Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[27] Daan Wierstra,et al. Variational Intrinsic Control , 2016, ICLR.

[28] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[29] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[30] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.

[31] Chrystopher L. Nehaniv,et al. All Else Being Equal Be Empowered , 2005, ECAL.

[32] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[33] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[34] Zheng Wen,et al. Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[35] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[37] Sergey Levine,et al. InfoBot: Transfer and Exploration via the Information Bottleneck , 2019, ICLR.

[38] Tim Rocktäschel,et al. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments , 2020, ICLR.

[39] Shakir Mohamed,et al. Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[40] Edward Grefenstette,et al. The NetHack Learning Environment , 2020, NeurIPS.

[41] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[42] Sergey Levine,et al. Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[43] Christoph Salge,et al. Empowerment - an Introduction , 2013, ArXiv.

[44] Marcin Andrychowicz,et al. Parameter Space Noise for Exploration , 2017, ICLR.

[45] Joshua B. Tenenbaum,et al. Learning with AMIGo: Adversarially Motivated Intrinsic Goals , 2020, ArXiv.

[46] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[47] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[48] Brian Yamauchi,et al. A frontier-based approach for autonomous exploration , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[49] Pierre-Yves Oudeyer,et al. GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms , 2017, ICML.

[50] Brian Yamauchi,et al. Frontier-based exploration using multiple robots , 1998, AGENTS '98.

[51] Pieter Abbeel,et al. Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[52] Deepak Pathak,et al. Self-Supervised Exploration via Disagreement , 2019, ICML.

[53] Satinder Singh,et al. Self-Imitation Learning , 2018, ICML.

[54] Abhinav Gupta,et al. Hierarchical RL Using an Ensemble of Proprioceptive Periodic Policies , 2019, ICLR.

[55] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[56] Emma Brunskill,et al. Directed Exploration for Reinforcement Learning , 2019, ArXiv.