论文信息 - InfoBot: Transfer and Exploration via the Information Bottleneck

InfoBot: Transfer and Exploration via the Information Bottleneck

A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.

[1] Jürgen Schmidhuber,et al. Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[2] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[3] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[4] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[5] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[6] Andrew G. Barto,et al. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[7] E. Miller,et al. An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[8] Doina Precup,et al. Learning Options in Reinforcement Learning , 2002, SARA.

[9] Shie Mannor,et al. Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[10] Peter Dayan,et al. Structure in the Space of Value Functions , 2002, Machine Learning.

[11] Alicia P. Wolfe,et al. Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[12] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[13] Hamid Beigy,et al. Using Strongly Connected Components as a Basis for Autonomous Skill Acquisition in Reinforcement Learning , 2009, ISNN.

[14] Daniel Polani,et al. Grounding subgoals in information transitions , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[15] Renato Renner,et al. An intuitive proof of the data processing inequality , 2011, Quantum Inf. Comput..

[16] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[17] J. Kinney,et al. Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[18] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[19] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[20] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[21] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[22] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[24] Olivier Marre,et al. Relevant sparse codes with variational information bottleneck , 2016, NIPS.

[25] Tom Schaul,et al. FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[26] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.