Environmental statistics and the trade-off between model-based and TD learning in humans

There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efficiency in different circumstances, there is little specific evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rule-based vs. incremental learning.

[1]  M. Gluck,et al.  Interactive memory systems in the human brain , 2001, Nature.

[2]  Larry King,et al.  Feedback and task predictability as determinants of performance in multiple cue probability learning tasks , 1976 .

[3]  Berndt Brehmer,et al.  Task information and performance in probabilistic inference tasks , 1978 .

[4]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[5]  A. Bechara Decision making, impulse control and loss of willpower to resist drugs: a neurocognitive perspective , 2005, Nature Neuroscience.

[6]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[7]  Peter Dayan,et al.  Temporal difference models describe higher-order learning in humans , 2004, Nature.

[8]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[9]  Kenji Doya,et al.  Brain mechanism of reward prediction under predictable and unpredictable environmental dynamics , 2006, Neural Networks.

[10]  B. Balleine,et al.  Multiple Forms of Value Learning and the Function of Dopamine , 2009 .

[11]  Kenji Doya,et al.  What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? , 1999, Neural Networks.

[12]  A. David Redish,et al.  Hippocampal replay contributes to within session learning in a temporal difference reinforcement learning model , 2005, Neural Networks.

[13]  Peter Dayan,et al.  Hippocampal Contributions to Control: The Third Way , 2007, NIPS.

[14]  F. Toates The interaction of cognitive and stimulus–response processes in the control of behaviour , 1997, Neuroscience & Biobehavioral Reviews.

[15]  B. Balleine,et al.  Goal-directed instrumental action: contingency and incentive learning and their cortical substrates , 1998, Neuropharmacology.

[16]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[17]  Peter Dayan,et al.  Goal-directed control and its antipodes , 2009, Neural Networks.

[18]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[19]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[20]  W Todd Maddox,et al.  Category number impacts rule-based but not information-integration category learning: further evidence for dissociable category-learning systems. , 2004, Journal of experimental psychology. Learning, memory, and cognition.

[21]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[22]  M. Gluck,et al.  Probabilistic classification learning in amnesia. , 1994, Learning & memory.

[23]  W. T. Maddox,et al.  Dissociating explicit and procedural-learning based systems of perceptual category learning , 2004, Behavioural Processes.