Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills

Acquiring abilities in the absence of a task-oriented reward function is at the frontier of reinforcement learning research. This problem has been studied through the lens of empowerment, which draws a connection between option discovery and information theory. Information-theoretic skill discovery methods have garnered much interest from the community, but little research has been conducted in understanding their limitations. Through theoretical analysis and empirical evidence, we show that existing algorithms suffer from a common limitation -- they discover options that provide a poor coverage of the state space. In light of this, we propose 'Explore, Discover and Learn' (EDL), an alternative approach to information-theoretic skill discovery. Crucially, EDL optimizes the same information-theoretic objective derived from the empowerment literature, but addresses the optimization problem using different machinery. We perform an extensive evaluation of skill discovery methods on controlled environments and show that EDL offers significant advantages, such as overcoming the coverage problem, reducing the dependence of learned skills on the initial state, and allowing the user to define a prior over which behaviors should be learned. Code is publicly available at this https URL.

[1]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[2]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[3]  Kenneth O. Stanley,et al.  Novelty Search and the Problem with Objectives , 2011 .

[4]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[5]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[7]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[8]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[9]  Jordi Torres,et al.  Mining urban events from the tweet stream through a probabilistic mixture model , 2018, Data Mining and Knowledge Discovery.

[10]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[11]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[12]  R. Miikkulainen,et al.  Learning Behavior Characterizations for Novelty Search , 2016, GECCO.

[13]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[14]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[15]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[16]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[17]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[18]  Sebastian Scherer,et al.  Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution , 2017, ICML.

[19]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[20]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[23]  Richard Socher,et al.  Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards , 2019, NeurIPS.

[24]  Kenneth O. Stanley,et al.  Quality Diversity: A New Frontier for Evolutionary Computation , 2016, Front. Robot. AI.

[25]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[26]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[27]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[28]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[29]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[30]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[31]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[32]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[33]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[34]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[35]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[36]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[37]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[38]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[39]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[40]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[41]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[42]  Tom Schaul,et al.  Universal Successor Features Approximators , 2018, ICLR.

[43]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[44]  Sergey Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[45]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[46]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[47]  Ion Stoica,et al.  Multi-Level Discovery of Deep Options , 2017, ArXiv.

[48]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[49]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[50]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[51]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[52]  Kenneth O. Stanley,et al.  Abandoning Objectives: Evolution Through the Search for Novelty Alone , 2011, Evolutionary Computation.

[53]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[54]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[55]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[56]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[57]  Allan Jabri,et al.  Unsupervised Curricula for Visual Meta-Reinforcement Learning , 2019, NeurIPS.

[58]  Katja Hofmann,et al.  The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors , 2019, ArXiv.

[59]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[60]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[61]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[62]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[63]  Martin A. Riedmiller,et al.  Self-supervised Learning of Image Embedding for Continuous Control , 2019, ArXiv.

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Pieter Abbeel,et al.  Meta Learning Shared Hierarchies , 2017, ICLR.

[66]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[67]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[68]  Antoine Cully,et al.  Robots that can adapt like animals , 2014, Nature.

[69]  Jean-Baptiste Mouret,et al.  Illuminating search spaces by mapping elites , 2015, ArXiv.

[70]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[71]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[72]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[73]  Richard Socher,et al.  Competitive Experience Replay , 2019, ICLR.

[74]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[75]  Pieter Abbeel,et al.  Variational Option Discovery Algorithms , 2018, ArXiv.