Successor Feature Sets: Generalizing Successor Representations Across Policies

Successor-style representations have many advantages for reinforcement learning: for example, they can help an agent generalize from past experience to new goals, and they have been proposed as explanations of behavioral and neural data from human and animal learners. They also form a natural bridge between model-based and model-free RL methods: like the former they make predictions about future experiences, and like the latter they allow efficient prediction of total discounted rewards. However, successor-style representations are not optimized to generalize across policies: typically, we maintain a limited-length list of policies, and share information among them by representation learning or GPI. Successor-style representations also typically make no provision for gathering information or reasoning about latent variables. To address these limitations, we bring together ideas from predictive state representations, belief space value iteration, successor features, and convex analysis: we develop a new, general successor-style representation, together with a Bellman equation that connects multiple sources of information within this representation, including different latent states, policies, and reward functions. The new representation is highly expressive: for example, it lets us efficiently read off an optimal policy for a new reward function, or a policy that imitates a new demonstration. For this paper, we focus on exact computation of the new representation in small, known environments, since even this restricted setting offers plenty of interesting questions. Our implementation does not scale to large, unknown environments — nor would we expect it to, since it generalizes POMDP value iteration, which is difficult to scale. However, we believe that future work will allow us to extend our ideas to approximate reasoning in large, unknown environments. We conduct experiments to explore which of the potential barriers to scaling are most pressing.

[1]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[3]  Guy Shani,et al.  A survey of point-based POMDP solvers , 2013, Autonomous Agents and Multi-Agent Systems.

[4]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[5]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[6]  Geoffrey J. Gordon,et al.  Supervised Learning for Dynamical System Learning , 2015, NIPS.

[7]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[8]  Yoshua Bengio,et al.  Universal Successor Representations for Transfer Reinforcement Learning , 2018, ICLR.

[9]  Per B. Sederberg,et al.  The Successor Representation and Temporal Context , 2012, Neural Computation.

[10]  Samuel J Gershman,et al.  The Successor Representation: Its Computational Logic and Neural Substrates , 2018, The Journal of Neuroscience.

[11]  Doina Precup,et al.  Fast reinforcement learning with generalized policy updates , 2020, Proceedings of the National Academy of Sciences.

[12]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[13]  Samuel Gershman,et al.  Predictive representations can link model-based reinforcement learning to model-free mechanisms , 2017, bioRxiv.

[14]  M. Botvinick,et al.  The hippocampus as a predictive map , 2016 .

[15]  Samuel Gershman,et al.  Deep Successor Reinforcement Learning , 2016, ArXiv.

[16]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[17]  Wolfram Burgard,et al.  Deep reinforcement learning with successor features for navigation across similar environments , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[19]  Tom Schaul,et al.  Universal Successor Features Approximators , 2018, ICLR.

[20]  Srivatsan Srinivasan,et al.  Truly Batch Apprenticeship Learning with Deep Successor Features , 2019, IJCAI.

[21]  Maneesh Sahani,et al.  A neurally plausible model learns successor representations in partially observable environments , 2019, NeurIPS.

[22]  Geoffrey Schoenbaum,et al.  Rethinking dopamine as generalized prediction error , 2018, bioRxiv.

[23]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[24]  Michael L. Littman,et al.  Successor Features Combine Elements of Model-Free and Model-based Reinforcement Learning , 2020, J. Mach. Learn. Res..

[25]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[26]  Marlos C. Machado,et al.  Eigenoption Discovery through the Deep Successor Representation , 2017, ICLR.

[27]  Daniel Hsu A New Framework for Query Efficient Active Imitation Learning , 2019, ArXiv.