Predictive representations can link model-based reinforcement learning to model-free mechanisms

Humans and animals are capable of evaluating actions by considering their long-run future rewards through a process described using model-based reinforcement learning (RL) algorithms. The mechanisms by which neural circuits perform the computations prescribed by model-based RL remain largely unknown; however, multiple lines of evidence suggest that neural circuits supporting model-based behavior are structurally homologous to and overlapping with those thought to carry out model-free temporal difference (TD) learning. Here, we lay out a family of approaches by which model-based computation may be built upon a core of TD learning. The foundation of this framework is the successor representation, a predictive state representation that, when combined with TD learning of value predictions, can produce a subset of the behaviors associated with model-based learning at a fraction of the computational cost. Using simulations, we delineate the precise behavioral capabilities enabled by evaluating actions using this approach, and compare them to those demonstrated by biological organisms. We then introduce two new algorithms that build upon the successor representation while progressively mitigating its limitations. Because this framework can account for the full range of observed putatively model-based behaviors while still utilizing a core TD framework, we suggest that it represents a neurally plausible family of mechanisms for model-based evaluation. Author Summary According to standard models, when confronted with a choice, animals and humans rely on two separate, distinct processes to come to a decision. One process deliberatively evaluates the consequences of each candidate action and is thought to underlie the ability to flexibly come up with novel plans. The other process gradually increases the propensity to perform behaviors that were previously successful and is thought to underlie automatically executed, habitual reflexes. Although computational principles and animal behavior support this dichotomy, at the neural level, there is little evidence supporting a clean segregation. For instance, although dopamine — famously implicated in drug addiction and Parkinson’s disease — currently only has a well-defined role in the automatic process, evidence suggests that it also plays a role in the deliberative process. In this work, we present a computational framework for resolving this mismatch. We show that the types of behaviors associated with either process could result from a common learning mechanism applied to different strategies for how populations of neurons could represent candidate actions. In addition to demonstrating that this account can produce the full range of flexible behavior observed in the empirical literature, we suggest experiments that could detect the various approaches within this framework.

[1]  F. Ciancia Tolman and Honzik (1930) revisited or the mazes of psychology (1930-1980) , 1991 .

[2]  Samuel Gershman,et al.  Design Principles of the Hippocampal Cognitive Map , 2014, NIPS.

[3]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[4]  G. E. Alexander,et al.  Functional architecture of basal ganglia circuits: neural substrates of parallel processing , 1990, Trends in Neurosciences.

[5]  P. Glimcher,et al.  Value Representations in the Primate Striatum during Matching Behavior , 2008, Neuron.

[6]  Colin Camerer,et al.  Experience‐weighted Attraction Learning in Normal Form Games , 1999 .

[7]  Rajesh P. N. Rao,et al.  Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning , 2001, Neural Computation.

[8]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[9]  A. Faure,et al.  Lesion to the Nigrostriatal Dopamine System Disrupts Stimulus-Response Habit Formation , 2005, The Journal of Neuroscience.

[10]  Shantanu P. Jadhav,et al.  Interplay between Hippocampal Sharp-Wave-Ripple Events and Vicarious Trial and Error Behaviors in Decision Making , 2016, Neuron.

[11]  Robert C. Wilson,et al.  Orbitofrontal Cortex as a Cognitive Map of Task Space , 2014, Neuron.

[12]  A. Markman,et al.  Journal of Experimental Psychology : General Retrospective Revaluation in Sequential Decision Making : A Tale of Two Systems , 2012 .

[13]  Martin Egelhaaf,et al.  Prototypical Components of Honeybee Homing Flight Behavior Depend on the Visual Appearance of Objects Surrounding the Goal , 2012, Front. Behav. Neurosci..

[14]  Hugo J. Spiers,et al.  Solving the detour problem in navigation: a model of prefrontal and hippocampal interactions , 2015, Front. Hum. Neurosci..

[15]  B. Balleine,et al.  The role of the dorsomedial striatum in instrumental conditioning , 2005, The European journal of neuroscience.

[16]  Peter Dayan,et al.  Simple Plans or Sophisticated Habits? State, Transition and Learning Interactions in the Two-Step Task , 2015, bioRxiv.

[17]  C. Thinus-Blanc,et al.  Route planning in cats, in relation to the visibility of the goal , 1983, Animal Behaviour.

[18]  T. Sejnowski,et al.  Neurocomputational models of working memory , 2000, Nature Neuroscience.

[19]  Tao Wang,et al.  Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[20]  Nathaniel D. Daw,et al.  Value Learning through Reinforcement , 2014 .

[21]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[22]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[23]  Giovanni Pezzulo,et al.  The Mixed Instrumental Controller: Using Value of Information to Combine Habitual Choice and Mental Simulation , 2013, Front. Psychol..

[24]  B. Balleine,et al.  Human and Rodent Homologies in Action Control: Corticostriatal Determinants of Goal-Directed and Habitual Action , 2010, Neuropsychopharmacology.

[25]  P. Dayan,et al.  Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum , 2016, Proceedings of the National Academy of Sciences.

[26]  Nicolas W. Schuck,et al.  Human Orbitofrontal Cortex Represents a Cognitive Map of State Space , 2016, Neuron.

[27]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[28]  K. Doya,et al.  Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[29]  John-Dylan Haynes,et al.  Human anterior prefrontal cortex encodes the ‘what’ and ‘when’ of future intentions , 2012, NeuroImage.

[30]  M. Botvinick,et al.  The hippocampus as a predictive map , 2016 .

[31]  M. Wilson,et al.  Disruption of ripple‐associated hippocampal activity during rest impairs spatial learning in the rat , 2009, Hippocampus.

[32]  Andrew M. Wikenheiser,et al.  Over the river, through the woods: cognitive maps in the hippocampus and orbitofrontal cortex , 2016, Nature Reviews Neuroscience.

[33]  N. Daw,et al.  Deciding How To Decide: Self-Control and Meta-Decision Making , 2015, Trends in Cognitive Sciences.

[34]  Per B. Sederberg,et al.  The Successor Representation and Temporal Context , 2012, Neural Computation.

[35]  Ilana B. Witten,et al.  Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target , 2016, Nature Neuroscience.

[36]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[38]  B. Balleine,et al.  Sensitivity to Instrumental Contingency Degradation Is Mediated by the Entorhinal Cortex and Its Efferents via the Dorsal Hippocampus , 2002, The Journal of Neuroscience.

[39]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[40]  Timothy Edward John Behrens,et al.  Two Anatomically and Computationally Distinct Learning Signals Predict Changes to Stimulus-Outcome Associations in Hippocampus , 2016, Neuron.

[41]  Alice Alvernhe,et al.  Different CA1 and CA3 Representations of Novel Routes in a Shortcut Situation , 2008, The Journal of Neuroscience.

[42]  M. Corballis Wandering tales: evolutionary origins of mental time travel and language , 2013, Front. Psychol..

[43]  B. McNaughton,et al.  Reactivation of Hippocampal Cell Assemblies: Effects of Behavioral State, Experience, and EEG Dynamics , 1999, The Journal of Neuroscience.

[44]  P. Dayan,et al.  Tonic dopamine: opportunity costs and the control of response vigor , 2007, Psychopharmacology.

[45]  E. Miller,et al.  An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[46]  G. Buzsáki Two-stage model of memory trace formation: A role for “noisy” brain states , 1989, Neuroscience.

[47]  R. Dolan,et al.  Dopamine Enhances Model-Based over Model-Free Choice Behavior , 2012, Neuron.

[48]  Dylan A. Simon,et al.  Neural Correlates of Forward Planning in a Spatial Decision Task in Humans , 2011, The Journal of Neuroscience.

[49]  P. Dayan,et al.  The algorithmic anatomy of model-based evaluation , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[50]  N. Daw,et al.  Variability in Dopamine Genes Dissociates Model-Based and Model-Free Reinforcement Learning , 2016, The Journal of Neuroscience.

[51]  Ari Weinstein,et al.  Model-based hierarchical reinforcement learning and human action control , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[52]  P. Dayan,et al.  Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[53]  Nathaniel D. Daw,et al.  Grid Cells, Place Cells, and Geodesic Generalization for Spatial Reinforcement Learning , 2011, PLoS Comput. Biol..

[54]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[55]  B. McNaughton,et al.  Reactivation of hippocampal ensemble memories during sleep. , 1994, Science.

[56]  R. Suri Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model , 2001, Experimental Brain Research.

[57]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[58]  Lauren V. Kustner,et al.  Shaping of Object Representations in the Human Medial Temporal Lobe Based on Temporal Regularities , 2012, Current Biology.

[59]  M. Botvinick,et al.  The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[60]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[61]  B. Balleine,et al.  Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning , 2004, The European journal of neuroscience.

[62]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[63]  Geoffrey Schoenbaum,et al.  Midbrain dopamine neurons compute inferred and cached value prediction errors in a common framework , 2016, eLife.

[64]  H. Nakahara Multiplexing signals in reinforcement learning with internal models and dopamine , 2014, Current Opinion in Neurobiology.

[65]  Philippe Gaussier,et al.  From view cells and place cells to cognitive map learning: processing stages of the hippocampal system , 2002, Biological Cybernetics.

[66]  B. Balleine,et al.  Habits, action sequences and reinforcement learning , 2012, The European journal of neuroscience.

[67]  Peter Dayan,et al.  Interplay of approximate planning strategies , 2015, Proceedings of the National Academy of Sciences.

[68]  Makoto Ito,et al.  Model-based action planning involves cortico-cerebellar and basal ganglia networks , 2016, Scientific Reports.

[69]  Morris Moscovitch,et al.  An investigation of the effects of hippocampal lesions in rats on pre‐ and postoperatively acquired spatial memory in a complex environment , 2010, Hippocampus.

[70]  Kate Jeffery,et al.  Horizontal biases in rats’ use of three-dimensional space , 2011, Behavioural Brain Research.

[71]  P. Glimcher Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis , 2011, Proceedings of the National Academy of Sciences.

[72]  Dylan A. Simon,et al.  Model-based choices involve prospective neural activity , 2015, Nature Neuroscience.

[73]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[74]  Soo-Young Lee,et al.  An Optimization Network for Matrix Inversion , 1987, NIPS.

[75]  Matthijs A. A. van der Meer,et al.  Expectancies in Decision Making, Reinforcement Learning, and Ventral Striatum , 2009, Frontiers in neuroscience.

[76]  Timothy E. J. Behrens,et al.  Online evaluation of novel choices by simultaneous representation of multiple memories , 2013, Nature Neuroscience.

[77]  B. Balleine,et al.  Motivational control of goal-directed action , 1994 .

[78]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[79]  M. Botvinick,et al.  Neural representations of events arise from temporal community structure , 2013, Nature Neuroscience.

[80]  David S. Lorberbaum,et al.  Genetic evidence that Nkx2.2 acts primarily downstream of Neurog3 in pancreatic endocrine lineage development , 2017, eLife.

[81]  C. H. Honzik,et al.  Degrees of hunger, reward and non-reward, and maze learning in rats, and Introduction and removal of reward, and maze performance in rats , 1930 .

[82]  Anders Lansner,et al.  Computing the Local Field Potential (LFP) from Integrate-and-Fire Network Models , 2015, PLoS Comput. Biol..

[83]  B. Balleine,et al.  Multiple Forms of Value Learning and the Function of Dopamine , 2009 .

[84]  Kenji Doya,et al.  What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? , 1999, Neural Networks.

[85]  Takeo Watanabe,et al.  Temporally Extended Dopamine Responses to Perceptually Demanding Reward-Predictive Stimuli , 2010, The Journal of Neuroscience.

[86]  N. Daw,et al.  Dopamine selectively remediates 'model-based' reward learning: a computational approach. , 2016, Brain : a journal of neurology.

[87]  Soo Hong Chew,et al.  Dissociable contribution of prefrontal and striatal dopaminergic genes to learning in economic games , 2014, Proceedings of the National Academy of Sciences.

[88]  Kenji Doya,et al.  Hierarchical control of goal-directed action in the cortical–basal ganglia network , 2015, Current Opinion in Behavioral Sciences.

[89]  Petra Himmel,et al.  Stevens Handbook Of Experimental Psychology Learning Motivation And Emotion , 2016 .

[90]  G. Buzsáki,et al.  Selective suppression of hippocampal ripples impairs spatial memory , 2009, Nature Neuroscience.

[91]  Shinsuke Shimojo,et al.  Neural Computations Underlying Arbitration between Model-Based and Model-free Learning , 2013, Neuron.

[92]  Alec Solway,et al.  Goal-directed decision making as probabilistic inference: a computational framework and potential neural correlates. , 2012, Psychological review.

[93]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[94]  D. Shohamy,et al.  Preference by Association: How Memory Mechanisms in the Hippocampus Bias Decisions , 2012, Science.

[95]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[96]  B. Balleine,et al.  The Role of Learning in the Operation of Motivational Systems , 2002 .

[97]  Saori C. Tanaka,et al.  Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops , 2004, Nature Neuroscience.

[98]  J. Danckert,et al.  Deficits in reflexive covert attention following cerebellar injury , 2015, Front. Hum. Neurosci..

[99]  P. Dayan,et al.  A mathematical model explains saturating axon guidance responses to molecular gradients , 2016, eLife.

[100]  R. Dolan,et al.  Ventral striatal dopamine reflects behavioral and neural signatures of model-based control during sequential decision making , 2015, Proceedings of the National Academy of Sciences.

[101]  S. Haber The primate basal ganglia: parallel and integrative networks , 2003, Journal of Chemical Neuroanatomy.

[102]  A. Dickinson Actions and habits: the development of behavioural autonomy , 1985 .

[103]  M. Botvinick,et al.  Statistical learning of temporal community structure in the hippocampus , 2016, Hippocampus.

[104]  C. Thinus-Blanc,et al.  The role of exploratory experience in a shortcut task by golden hamsters (Mesocricetus auratus) , 1987 .

[105]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[106]  L. Frank,et al.  Awake Hippocampal Sharp-Wave Ripples Support Spatial Memory , 2012, Science.

[107]  Richard S. Sutton,et al.  Associative Learning from Replayed Experience , 2017, bioRxiv.

[108]  I. Momennejad,et al.  Encoding of Prospective Tasks in the Human Prefrontal Cortex under Varying Task Loads , 2013, The Journal of Neuroscience.

[109]  Matthijs A. A. van der Meer,et al.  Hippocampal Replay Is Not a Simple Function of Experience , 2010, Neuron.