Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.

[1]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[2]  Wolfram Burgard,et al.  Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration , 2019, ArXiv.

[3]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[4]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[5]  Yuan Zhou,et al.  Exploration via Hindsight Goal Generation , 2019, NeurIPS.

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Hussein A. Abbass,et al.  Hierarchical Deep Reinforcement Learning for Continuous Action Control , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[10]  Honglak Lee,et al.  Contingency-Aware Exploration in Reinforcement Learning , 2018, ICLR.

[11]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[12]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[13]  Deepak Pathak,et al.  Self-Supervised Exploration via Disagreement , 2019, ICML.

[14]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[15]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[16]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[17]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[18]  Lucas Beyer,et al.  MULEX: Disentangling Exploitation from Exploration in Deep RL , 2019, ArXiv.

[19]  Rui Zhao,et al.  Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[20]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[21]  Marc Pollefeys,et al.  Episodic Curiosity through Reachability , 2018, ICLR.

[22]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[23]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[24]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[25]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[26]  Daoyi Dong,et al.  Self-Paced Prioritized Curriculum Learning With Coverage Penalty in Deep Reinforcement Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[28]  Wojciech Jaskowski,et al.  Model-Based Active Exploration , 2018, ICML.

[29]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[30]  Robert Loftin,et al.  Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[31]  Sergey Levine,et al.  EMI: Exploration with Mutual Information , 2018, ICML.

[32]  Shimon Whiteson,et al.  Optimistic Exploration even with a Pessimistic Initialisation , 2020, ICLR.

[33]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[34]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[35]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[36]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[38]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[39]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[40]  Dmitry Vetrov,et al.  Variational Autoencoder with Arbitrary Conditioning , 2018, ICLR.

[41]  Zhang-Wei Hong,et al.  Diversity-Driven Exploration Strategy for Deep Reinforcement Learning , 2018, NeurIPS.

[42]  Greg Turk,et al.  Learning Novel Policies For Tasks , 2019, ICML.

[43]  Daniel L. K. Yamins,et al.  Learning to Play with Intrinsically-Motivated Self-Aware Agents , 2018, NeurIPS.

[44]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[45]  Wilker Aziz,et al.  A Stochastic Decoder for Neural Machine Translation , 2018, ACL.

[46]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[47]  Junhyuk Oh,et al.  What Can Learned Intrinsic Rewards Capture? , 2019, ICML.

[48]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[49]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[50]  E. Deci,et al.  Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions. , 2000, Contemporary educational psychology.

[51]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[52]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[53]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..