Dynamic Bottleneck for Robust Self-Supervised Exploration

Exploration methods based on pseudo-count of transitions or curiosity of dynamics have achieved promising results in solving reinforcement learning with sparse rewards. However, such methods are usually sensitive to environmental dynamics-irrelevant information, e.g., white-noise. To handle such dynamics-irrelevant information, we propose a Dynamic Bottleneck (DB) model, which attains a dynamics-relevant representation based on the information-bottleneck principle. Based on the DB model, we further propose DB-bonus, which encourages the agent to explore state-action pairs with high information gain. We establish theoretical connections between the proposed DB-bonus, the upper confidence bound (UCB) for linear case, and the visiting count for tabular case. We evaluate the proposed method on Atari suits with dynamics-irrelevant noises. Our experiments show that exploration with DB bonus outperforms several state-of-the-art exploration methods in noisy environments.

[1]  Tianpei Yang,et al.  Exploration in Deep Reinforcement Learning: A Comprehensive Survey , 2021, ArXiv.

[2]  Pieter Abbeel,et al.  APS: Active Pretraining with Successor Features , 2021, ICML.

[3]  Jianye Hao,et al.  Principled Exploration via Optimistic Bootstrapping and Backward Induction , 2021, ICML.

[4]  P. Abbeel,et al.  Behavior From the Void: Unsupervised Active Pre-Training , 2021, NeurIPS.

[5]  Alessandro Lazaric,et al.  Reinforcement Learning with Prototypical Representations , 2021, ICML.

[6]  Jinwoo Shin,et al.  State Entropy Maximization with Random Encoders for Efficient Exploration , 2021, ICML.

[7]  Lingxiao Wang,et al.  Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Aaron C. Courville,et al.  Data-Efficient Reinforcement Learning with Self-Predictive Representations , 2020, ICLR.

[9]  Zhuoran Yang,et al.  Provably Efficient Causal Reinforcement Learning with Confounded Observational Data , 2020, NeurIPS.

[10]  R. Fergus,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ICLR.

[11]  Kurt Keutzer,et al.  BeBold: Exploration Beyond the Boundary of Explored Regions , 2020, ArXiv.

[12]  Joelle Pineau,et al.  Novelty Search in representational space for sample efficient exploration , 2020, NeurIPS.

[13]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[14]  R Devon Hjelm,et al.  Deep Reinforcement and InfoMax Learning , 2020, NeurIPS.

[15]  P. Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[16]  Marlos C. Machado,et al.  On Bonus Based Exploration Methods In The Arcade Learning Environment , 2020, ICLR.

[17]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[18]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[19]  Shimon Whiteson,et al.  Optimistic Exploration even with a Pessimistic Initialisation , 2020, ICLR.

[20]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[21]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[22]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[23]  Wei Xu,et al.  Implicit Generative Modeling for Efficient Exploration , 2019, ICML.

[24]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[26]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[27]  Ji-Hoon Kim,et al.  Curiosity-Bottleneck: Exploration By Distilling Task-Specific Novelty , 2019, ICML.

[28]  Deepak Pathak,et al.  Self-Supervised Exploration via Disagreement , 2019, ICML.

[29]  Sergey Levine,et al.  EMI: Exploration with Mutual Information , 2018, ICML.

[30]  Honglak Lee,et al.  Contingency-Aware Exploration in Reinforcement Learning , 2018, ICLR.

[31]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[32]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[33]  Marta Z. Kwiatkowska,et al.  Evaluating Uncertainty Quantification in End-to-End Autonomous Driving Control , 2018, ArXiv.

[34]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[35]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[36]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[37]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[38]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[40]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[41]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[42]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[43]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[44]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[45]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[46]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[47]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[48]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[49]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[50]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[51]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[52]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[53]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[54]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[55]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[56]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[57]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[58]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[59]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[60]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[61]  M. West Outlier Models and Prior Distributions in Bayesian Linear Regression , 1984 .