Reward-predictive representations generalize across tasks in reinforcement learning

In computer science, reinforcement learning is a powerful framework with which artificial agents can learn to maximize their performance for any given Markov decision process (MDP). Advances over the last decade, in combination with deep neural networks, have enjoyed performance advantages over humans in many difficult task settings. However, such frameworks perform far less favorably when evaluated in their ability to generalize or transfer representations across different tasks. Existing algorithms that facilitate transfer typically are limited to cases in which the transition function or the optimal policy is portable to new contexts, but achieving “deep transfer” characteristic of human behavior has been elusive. Such transfer typically requires discovery of abstractions that permit analogical reuse of previously learned representations to superficially distinct tasks. Here, we demonstrate that abstractions that minimize error in predictions of reward outcomes generalize across tasks with different transition and reward functions. Such reward-predictive representations compress the state space of a task into a lower dimensional representation by combining states that are equivalent in terms of both the transition and reward functions. Because only state equivalences are considered, the resulting state representation is not tied to the transition and reward functions themselves and thus generalizes across tasks with different reward and transition functions. These results contrast with those using abstractions that myopically maximize reward in any given MDP and motivate further experiments in humans and animals to investigate if neural and cognitive systems involved in state representation perform abstractions that facilitate such equivalence relations. Author summary Humans are capable of transferring abstract knowledge from one task to another. For example, in a right-hand-drive country, a driver has to use the right arm to operate the shifter. A driver who learned how to drive in a right-hand-drive country can adapt to operating a left-hand-drive car and use the other arm for shifting instead of re-learning how to drive. Despite the fact that both tasks require different coordination of motor skills, both tasks are the same in an abstract sense: In both tasks, a car is operated and there is the same progression from 1st to 2nd gear and so on. We study distinct algorithms by which a reinforcement learning agent can discover state representations that encode knowledge about a particular task, and evaluate how well they can generalize. Through a sequence of simulation results, we show that state abstractions that minimize errors in prediction about future reward outcomes generalize across tasks, even those that superficially differ in both the goals (rewards) and the transitions from one state to the next. This work motivates biological studies to determine if distinct circuits are adapted to maximize reward vs. to discover useful state representations.

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[5]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[6]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[7]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[8]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[9]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[10]  Michael J. Frank,et al.  Hippocampus, cortex, and basal ganglia: Insights from computational models of complementary learning systems , 2004, Neurobiology of Learning and Memory.

[11]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming) , 2005 .

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[14]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[15]  Jadin C. Jackson,et al.  Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. , 2007, Psychological review.

[16]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[17]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[18]  Doina Precup,et al.  Bisimulation Metrics for Continuous Markov Decision Processes , 2011, SIAM J. Comput..

[19]  M. Frank,et al.  Mechanisms of hierarchical reinforcement learning in cortico-striatal circuits 2: evidence from fMRI. , 2012, Cerebral cortex.

[20]  M. Frank,et al.  Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis. , 2012, Cerebral cortex.

[21]  Doina Precup,et al.  Bisimulation Metrics are Optimal Value Functions , 2014, UAI.

[22]  Robert C. Wilson,et al.  Orbitofrontal Cortex as a Cognitive Map of Task Space , 2014, Neuron.

[23]  Anne G E Collins,et al.  Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive. , 2014, Psychological review.

[24]  Doina Precup,et al.  Basis refinement strategies for linear value function approximation in MDPs , 2015, NIPS.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[27]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[28]  Samuel Gershman,et al.  Deep Successor Reinforcement Learning , 2016, ArXiv.

[29]  Nicolas W. Schuck,et al.  Human Orbitofrontal Cortex Represents a Cognitive Map of State Space , 2016, Neuron.

[30]  M. Frank,et al.  Neural signature of hierarchically structured expectations predicts clustering and transfer of rule sets in reinforcement learning , 2016, Cognition.

[31]  Brad E. Pfeiffer,et al.  Reverse Replay of Hippocampal Place Cells Is Uniquely Modulated by Changing Reward , 2016, Neuron.

[32]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[33]  M. Botvinick,et al.  Statistical learning of temporal community structure in the hippocampus , 2016, Hippocampus.

[34]  Wolfram Burgard,et al.  Deep reinforcement learning with successor features for navigation across similar environments , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[35]  Joshua L. Jones,et al.  Dopamine transients are sufficient and necessary for acquisition of model-based associations , 2017, Nature Neuroscience.

[36]  Kimberly L. Stachenfeld,et al.  The hippocampus as a predictive map , 2017, Nature Neuroscience.

[37]  Raymond J Dolan,et al.  A map of abstract relational knowledge in the human hippocampal–entorhinal cortex , 2017, eLife.

[38]  Donna J. Calu,et al.  The Dopamine Prediction Error: Contributions to Associative Models of Reward Learning , 2017, Front. Psychol..

[39]  Michael J. Frank,et al.  Compositional clustering in task structure learning , 2017 .

[40]  Stefanie Tellex,et al.  Advantages and Limitations of using Successor Features for Transfer in Reinforcement Learning , 2017, ArXiv.

[41]  Samuel J. Gershman,et al.  Predictive representations can link model-based reinforcement learning to model-free mechanisms , 2017 .

[42]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[43]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[44]  Michael L. Littman,et al.  State Abstractions for Lifelong Reinforcement Learning , 2018, ICML.

[45]  Marcelo G Mattar,et al.  Prioritized memory access explains planning and hippocampal replay , 2017, Nature Neuroscience.

[46]  Timothy Edward John Behrens,et al.  Generalisation of structural knowledge in the Hippocampal-Entorhinal system , 2018, NeurIPS.

[47]  Zeb Kurth-Nelson,et al.  What Is a Cognitive Map? Organizing Knowledge for Flexible Behavior , 2018, Neuron.

[48]  Michael J. Frank,et al.  Compositional clustering in task structure learning , 2017, bioRxiv.

[49]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[50]  Michael J. Frank,et al.  Generalizing to generalize: Humans flexibly switch between compositional and conjunctive structures during reinforcement learning , 2019, bioRxiv.

[51]  Alyssa A. Carey,et al.  Reward revaluation biases hippocampal replay content away from the preferred outcome , 2019, Nature Neuroscience.

[52]  Michael L. Littman,et al.  Successor Features Support Model-based and Model-free Reinforcement Learning , 2019, ArXiv.

[53]  Tom Schaul,et al.  Universal Successor Features Approximators , 2018, ICLR.

[54]  Timothy E. J. Behrens,et al.  Human Replay Spontaneously Reorganizes Experience , 2019, Cell.

[55]  Michael J. Frank,et al.  Generalizing to generalize: when (and when not) to be compositional in task structure learning , 2019 .

[56]  Joelle Pineau,et al.  Combined Reinforcement Learning via Abstract Representations , 2018, AAAI.

[57]  Nicolas W. Schuck,et al.  Sequential replay of nonspatial task states in the human hippocampus , 2018, Science.

[58]  Caswell Barry,et al.  The Tolman-Eichenbaum Machine: Unifying Space and Relational Memory through Generalization in the Hippocampal Formation , 2019, Cell.

[59]  Michael L. Littman,et al.  Successor Features Combine Elements of Model-Free and Model-based Reinforcement Learning , 2019, J. Mach. Learn. Res..

[60]  Balaraman Ravindran Approximate Homomorphisms : A framework for non-exact minimization in Markov Decision Processes , 2022 .