论文信息 - Predictive representations can link model-based reinforcement learning to model-free mechanisms

Predictive representations can link model-based reinforcement learning to model-free mechanisms

Humans and animals are capable of evaluating actions by considering their long-run future rewards through a process described using model-based reinforcement learning (RL) algorithms. The mechanisms by which neural circuits perform the computations prescribed by model-based RL remain largely unknown; however, multiple lines of evidence suggest that neural circuits supporting model-based behavior are structurally homologous to and overlapping with those thought to carry out model-free temporal difference (TD) learning. Here, we lay out a family of approaches by which model-based computation may be built upon a core of TD learning. The foundation of this framework is the successor representation, a predictive state representation that, when combined with TD learning of value predictions, can produce a subset of the behaviors associated with model-based learning at a fraction of the computational cost. Using simulations, we delineate the precise behavioral capabilities enabled by evaluating actions using this approach, and compare them to those demonstrated by biological organisms. We then introduce two new algorithms that build upon the successor representation while progressively mitigating its limitations. Because this framework can account for the full range of observed putatively model-based behaviors while still utilizing a core TD framework, we suggest that it represents a neurally plausible family of mechanisms for model-based evaluation. Author Summary According to standard models, when confronted with a choice, animals and humans rely on two separate, distinct processes to come to a decision. One process deliberatively evaluates the consequences of each candidate action and is thought to underlie the ability to flexibly come up with novel plans. The other process gradually increases the propensity to perform behaviors that were previously successful and is thought to underlie automatically executed, habitual reflexes. Although computational principles and animal behavior support this dichotomy, at the neural level, there is little evidence supporting a clean segregation. For instance, although dopamine — famously implicated in drug addiction and Parkinson’s disease — currently only has a well-defined role in the automatic process, evidence suggests that it also plays a role in the deliberative process. In this work, we present a computational framework for resolving this mismatch. We show that the types of behaviors associated with either process could result from a common learning mechanism applied to different strategies for how populations of neurons could represent candidate actions. In addition to demonstrating that this account can produce the full range of flexible behavior observed in the empirical literature, we suggest experiments that could detect the various approaches within this framework.

Samuel J. Gershman | Matthew Botvinick | Ida Momennejad | Nathaniel D. Daw | Evan M. Russek

[1] F. Ciancia. Tolman and Honzik (1930) revisited or the mazes of psychology (1930-1980) , 1991 .

[2] Samuel Gershman,et al. Design Principles of the Hippocampal Cognitive Map , 2014, NIPS.

[3] Amir Dezfouli,et al. Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[4] G. E. Alexander,et al. Functional architecture of basal ganglia circuits: neural substrates of parallel processing , 1990, Trends in Neurosciences.

[5] P. Glimcher,et al. Value Representations in the Primate Striatum during Matching Behavior , 2008, Neuron.

[6] Colin Camerer,et al. Experience‐weighted Attraction Learning in Normal Form Games , 1999 .

[7] Rajesh P. N. Rao,et al. Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning , 2001, Neural Computation.

[8] James L. McClelland,et al. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[9] A. Faure,et al. Lesion to the Nigrostriatal Dopamine System Disrupts Stimulus-Response Habit Formation , 2005, The Journal of Neuroscience.

[10] Shantanu P. Jadhav,et al. Interplay between Hippocampal Sharp-Wave-Ripple Events and Vicarious Trial and Error Behaviors in Decision Making , 2016, Neuron.

[11] Robert C. Wilson,et al. Orbitofrontal Cortex as a Cognitive Map of Task Space , 2014, Neuron.

[12] A. Markman,et al. Journal of Experimental Psychology : General Retrospective Revaluation in Sequential Decision Making : A Tale of Two Systems , 2012 .

[13] Martin Egelhaaf,et al. Prototypical Components of Honeybee Homing Flight Behavior Depend on the Visual Appearance of Objects Surrounding the Goal , 2012, Front. Behav. Neurosci..

[14] Hugo J. Spiers,et al. Solving the detour problem in navigation: a model of prefrontal and hippocampal interactions , 2015, Front. Hum. Neurosci..

[15] B. Balleine,et al. The role of the dorsomedial striatum in instrumental conditioning , 2005, The European journal of neuroscience.

[16] Peter Dayan,et al. Simple Plans or Sophisticated Habits? State, Transition and Learning Interactions in the Two-Step Task , 2015, bioRxiv.

[17] C. Thinus-Blanc,et al. Route planning in cats, in relation to the visibility of the goal , 1983, Animal Behaviour.

[18] T. Sejnowski,et al. Neurocomputational models of working memory , 2000, Nature Neuroscience.

[19] Tao Wang,et al. Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[20] Nathaniel D. Daw,et al. Value Learning through Reinforcement , 2014 .

[21] Joel L. Davis,et al. A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[22] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[23] Giovanni Pezzulo,et al. The Mixed Instrumental Controller: Using Value of Information to Combine Habitual Choice and Mental Simulation , 2013, Front. Psychol..

[24] B. Balleine,et al. Human and Rodent Homologies in Action Control: Corticostriatal Determinants of Goal-Directed and Habitual Action , 2010, Neuropsychopharmacology.

[25] P. Dayan,et al. Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum , 2016, Proceedings of the National Academy of Sciences.

[26] Nicolas W. Schuck,et al. Human Orbitofrontal Cortex Represents a Cognitive Map of State Space , 2016, Neuron.

[27] Peter Dayan,et al. A Neural Substrate of Prediction and Reward , 1997, Science.

[28] K. Doya,et al. Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[29] John-Dylan Haynes,et al. Human anterior prefrontal cortex encodes the ‘what’ and ‘when’ of future intentions , 2012, NeuroImage.

[30] M. Botvinick,et al. The hippocampus as a predictive map , 2016 .

[31] M. Wilson,et al. Disruption of ripple‐associated hippocampal activity during rest impairs spatial learning in the rat , 2009, Hippocampus.

[32] Andrew M. Wikenheiser,et al. Over the river, through the woods: cognitive maps in the hippocampus and orbitofrontal cortex , 2016, Nature Reviews Neuroscience.

[33] N. Daw,et al. Deciding How To Decide: Self-Control and Meta-Decision Making , 2015, Trends in Cognitive Sciences.

[34] Per B. Sederberg,et al. The Successor Representation and Temporal Context , 2012, Neural Computation.

[35] Ilana B. Witten,et al. Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target , 2016, Nature Neuroscience.

[36] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37] E. Tolman. Cognitive maps in rats and men. , 1948, Psychological review.

[38] B. Balleine,et al. Sensitivity to Instrumental Contingency Degradation Is Mediated by the Entorhinal Cortex and Its Efferents via the Dorsal Hippocampus , 2002, The Journal of Neuroscience.

[39] Andrew W. Moore,et al. Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[40] Timothy Edward John Behrens,et al. Two Anatomically and Computationally Distinct Learning Signals Predict Changes to Stimulus-Outcome Associations in Hippocampus , 2016, Neuron.

[41] Alice Alvernhe,et al. Different CA1 and CA3 Representations of Novel Routes in a Shortcut Situation , 2008, The Journal of Neuroscience.

[42] M. Corballis. Wandering tales: evolutionary origins of mental time travel and language , 2013, Front. Psychol..

[43] B. McNaughton,et al. Reactivation of Hippocampal Cell Assemblies: Effects of Behavioral State, Experience, and EEG Dynamics , 1999, The Journal of Neuroscience.

[44] P. Dayan,et al. Tonic dopamine: opportunity costs and the control of response vigor , 2007, Psychopharmacology.

[45] E. Miller,et al. An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[46] G. Buzsáki. Two-stage model of memory trace formation: A role for “noisy” brain states , 1989, Neuroscience.

[47] R. Dolan,et al. Dopamine Enhances Model-Based over Model-Free Choice Behavior , 2012, Neuron.

[48] Dylan A. Simon,et al. Neural Correlates of Forward Planning in a Spatial Decision Task in Humans , 2011, The Journal of Neuroscience.

[49] P. Dayan,et al. The algorithmic anatomy of model-based evaluation , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[50] N. Daw,et al. Variability in Dopamine Genes Dissociates Model-Based and Model-Free Reinforcement Learning , 2016, The Journal of Neuroscience.

[51] Ari Weinstein,et al. Model-based hierarchical reinforcement learning and human action control , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[52] P. Dayan,et al. Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[53] Nathaniel D. Daw,et al. Grid Cells, Place Cells, and Geodesic Generalization for Spatial Reinforcement Learning , 2011, PLoS Comput. Biol..

[54] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[55] B. McNaughton,et al. Reactivation of hippocampal ensemble memories during sleep. , 1994, Science.

[56] R. Suri. Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model , 2001, Experimental Brain Research.

[57] P. Dayan,et al. States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[58] Lauren V. Kustner,et al. Shaping of Object Representations in the Human Medial Temporal Lobe Based on Temporal Regularities , 2012, Current Biology.

[59] M. Botvinick,et al. The successor representation in human reinforcement learning , 2016, Nature Human Behaviour.

[60] Richard S. Sutton,et al. TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[61] B. Balleine,et al. Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning , 2004, The European journal of neuroscience.