Eigenoption Discovery through the Deep Successor Representation

Options in reinforcement learning allow agents to hierarchically decompose a task into subtasks, having the potential to speed up learning and planning. However, autonomously learning effective sets of options is still a major challenge in the field. In this paper we focus on the recently introduced idea of using representation learning methods to guide the option discovery process. Specifically, we look at eigenoptions, options obtained from representations that encode diffusive information flow in the environment. We extend the existing algorithms for eigenoption discovery to settings with stochastic transitions and in which handcrafted features are not available. We propose an algorithm that discovers eigenoptions while learning non-linear state representations from raw pixels. It exploits recent successes in the deep reinforcement learning literature and the equivalence between proto-value functions and the successor representation. We use traditional tabular domains to provide intuition about our approach and Atari 2600 games to demonstrate its potential.

[1]  Marijn F. Stollenga,et al.  Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots , 2017, Artif. Intell..

[2]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[3]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[4]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[5]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[6]  Samuel Gershman,et al.  Deep Successor Reinforcement Learning , 2016, ArXiv.

[7]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[8]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[9]  Marc G. Bellemare,et al.  Investigating Contingency Awareness Using Atari 2600 Games , 2012, AAAI.

[10]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[11]  Sridhar Mahadevan,et al.  Proto-value functions: developmental reinforcement learning , 2005, ICML.

[12]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[13]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[14]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[15]  Peter Stone,et al.  The utility of temporal abstraction in reinforcement learning , 2008, AAMAS.

[16]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[17]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[18]  M. Botvinick,et al.  The hippocampus as a predictive map , 2016 .

[19]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[20]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[21]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[22]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[23]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[24]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[25]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[26]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[27]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[28]  Henning Sprekeler,et al.  On the Relation of Slow Feature Analysis and Laplacian Eigenmaps , 2011, Neural Computation.

[29]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[30]  Samuel Gershman,et al.  Design Principles of the Hippocampal Cognitive Map , 2014, NIPS.

[31]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[32]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[33]  Tao Wang,et al.  Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[34]  Alec Solway,et al.  Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[35]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..