Combined model-free and model-sensitive reinforcement learning in non-human primates

Contemporary reinforcement learning (RL) theory suggests that potential choices can be evaluated by strategies that may or may not be sensitive to the computational structure of tasks. A paradigmatic model-free (MF) strategy simply repeats actions that have been rewarded in the past; by contrast, model-sensitive (MS) strategies exploit richer information associated with knowledge of task dynamics. MF and MS strategies should typically be combined, because they have complementary statistical and computational strengths; however, this tradeoff between MF/MS RL has mostly only been demonstrated in humans, often with only modest numbers of trials. We trained rhesus monkeys to perform a two-stage decision task designed to elicit and discriminate the use of MF and MS methods. A descriptive analysis of choice behaviour revealed directly that the structure of the task (of MS importance) and the reward history (of MF and MS importance) significantly influenced both choice and response vigour. A detailed, trial-by-trial computational analysis confirmed that choices were made according to a combination of strategies, with a dominant influence of a particular form of model sensitivity that persisted over weeks of testing. The residuals from this model necessitated development of a new combined RL model which incorporates a particular credit assignment weighting procedure. Finally, response vigor exhibited a subtly different collection of MF and MS influences. These results provide new illumination onto RL behavioural processes in non-human primates.

[1]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  A. Dickinson Actions and habits: the development of behavioural autonomy , 1985 .

[5]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[6]  B. Balleine,et al.  Motivational control of goal-directed action , 1994 .

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  B. Balleine,et al.  The Role of Learning in the Operation of Motivational Systems , 2002 .

[9]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[10]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[13]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[14]  P. Dayan,et al.  Tonic dopamine: opportunity costs and the control of response vigor , 2007, Psychopharmacology.

[15]  M. Roesch,et al.  Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards , 2007, Nature Neuroscience.

[16]  Adam Johnson,et al.  Neural Ensembles in CA3 Transiently Encode Paths Forward of the Animal at a Decision Point , 2007, The Journal of Neuroscience.

[17]  Kevin McCabe,et al.  Neural signature of fictive learning signals in a sequential investment task , 2007, Proceedings of the National Academy of Sciences.

[18]  A. Gelman Scaling regression inputs by dividing by two standard deviations , 2008, Statistics in medicine.

[19]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[20]  B. Balleine,et al.  Human and Rodent Homologies in Action Control: Corticostriatal Determinants of Goal-Directed and Habitual Action , 2010, Neuropsychopharmacology.

[21]  Raymond J. Dolan,et al.  Disentangling the Roles of Approach, Activation and Valence in Instrumental and Pavlovian Responding , 2011, PLoS Comput. Biol..

[22]  Peter Dayan,et al.  Vigor in the Face of Fluctuating Rates of Reward: An Experimental Examination , 2011, Journal of Cognitive Neuroscience.

[23]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[24]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[25]  N. Daw,et al.  The ubiquity of model-based reinforcement learning , 2012, Current Opinion in Neurobiology.

[26]  P. Dayan How to set the switches on this thing , 2012, Current Opinion in Neurobiology.

[27]  R. Dolan,et al.  Dopamine Enhances Model-Based over Model-Free Choice Behavior , 2012, Neuron.

[28]  Bernard W. Balleine,et al.  Actions, Action Sequences and Habits: Evidence That Goal-Directed and Habitual Action Control Are Hierarchically Organized , 2013, PLoS Comput. Biol..

[29]  Giovanni Pezzulo,et al.  The Mixed Instrumental Controller: Using Value of Information to Combine Habitual Choice and Mental Simulation , 2013, Front. Psychol..

[30]  Ulrik R Beierholm,et al.  Dopamine Modulates Reward-Related Vigor , 2013, Neuropsychopharmacology.

[31]  P. Dayan,et al.  Goals and Habits in the Brain , 2013, Neuron.

[32]  Karl J. Friston,et al.  Bayesian model selection for group studies — Revisited , 2014, NeuroImage.

[33]  P. Dayan,et al.  The algorithmic anatomy of model-based evaluation , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[34]  L. Deserno,et al.  Model-Based and Model-Free Decisions in Alcohol Dependence , 2014, Neuropsychobiology.

[35]  A. Markman,et al.  Journal of Experimental Psychology : General Retrospective Revaluation in Sequential Decision Making : A Tale of Two Systems , 2012 .

[36]  Shinsuke Shimojo,et al.  Neural Computations Underlying Arbitration between Model-Based and Model-free Learning , 2013, Neuron.

[37]  Thomas H. B. FitzGerald,et al.  Transcranial Direct Current Stimulation of Right Dorsolateral Prefrontal Cortex Does Not Affect Model-Based or Model-Free Reinforcement Learning in Humans , 2014, PloS one.

[38]  K. Katahira The relation between reinforcement learning parameters and the influence of reinforcement history on choice behavior , 2015 .

[39]  Dylan A. Simon,et al.  Model-based choices involve prospective neural activity , 2015, Nature Neuroscience.

[40]  Peter Dayan,et al.  Simple Plans or Sophisticated Habits? State, Transition and Learning Interactions in the Two-Step Task , 2015, bioRxiv.

[41]  Zeb Kurth-Nelson,et al.  Model-Based Reasoning in Humans Becomes Automatic with Training , 2015, PLoS Comput. Biol..

[42]  A. Villringer,et al.  Lateral prefrontal model-based signatures are reduced in healthy individuals with high trait impulsivity , 2015, Translational Psychiatry.

[43]  F. Cushman,et al.  Habitual control of goal selection in humans , 2015, Proceedings of the National Academy of Sciences.

[44]  M. Botvinick,et al.  Reduced model-based decision-making in schizophrenia. , 2016, Journal of abnormal psychology.

[45]  P. Dayan,et al.  Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum , 2016, Proceedings of the National Academy of Sciences.

[46]  Brent Doiron,et al.  Scaling Properties of Dimensionality Reduction for Neural Populations and Network Models , 2016, PLoS Comput. Biol..

[47]  N. Daw,et al.  Variability in Dopamine Genes Dissociates Model-Based and Model-Free Reinforcement Learning , 2016, The Journal of Neuroscience.

[48]  Wouter Kool,et al.  When Does Model-Based Control Pay Off? , 2016, PLoS Comput. Biol..

[49]  N. Daw,et al.  Valence-dependent influence of serotonin depletion on model-based choice strategy , 2015, Molecular Psychiatry.

[50]  N. Daw,et al.  Characterizing a psychiatric symptom dimension related to deficits in goal-directed control , 2016, eLife.

[51]  N. Daw,et al.  Dopamine selectively remediates 'model-based' reward learning: a computational approach. , 2016, Brain : a journal of neurology.

[52]  Wouter Kool,et al.  Cost-Benefit Arbitration Between Multiple Reinforcement-Learning Systems , 2017, Psychological science.

[53]  Kevin J. Miller,et al.  Dorsal hippocampus contributes to model-based planning , 2017, Nature Neuroscience.

[54]  Thomas L. Griffiths,et al.  Rational metareasoning and the plasticity of cognitive control , 2018, PLoS Comput. Biol..

[55]  Todd A. Hare,et al.  Model-free or muddled models in the two-stage task? , 2019 .