Model-based reinforcement learning under concurrent schedules of reinforcement in rodents.

Reinforcement learning theories postulate that actions are chosen to maximize a long-term sum of positive outcomes based on value functions, which are subjective estimates of future rewards. In simple reinforcement learning algorithms, value functions are updated only by trial-and-error, whereas they are updated according to the decision-maker's knowledge or model of the environment in model-based reinforcement learning algorithms. To investigate how animals update value functions, we trained rats under two different free-choice tasks. The reward probability of the unchosen target remained unchanged in one task, whereas it increased over time since the target was last chosen in the other task. The results show that goal choice probability increased as a function of the number of consecutive alternative choices in the latter, but not the former task, indicating that the animals were aware of time-dependent increases in arming probability and used this information in choosing goals. In addition, the choice behavior in the latter task was better accounted for by a model-based reinforcement learning algorithm. Our results show that rats adopt a decision-making process that cannot be accounted for by simple reinforcement learning models even in a relatively simple binary choice task, suggesting that rats can readily improve their decision-making strategy through the knowledge of their environments.

[1]  R. Dolan,et al.  Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans , 2006, Nature.

[2]  W. F. Prokasy,et al.  Classical conditioning II: Current research and theory. , 1972 .

[3]  H. Seo,et al.  Temporal Filtering of Reward Signals in the Dorsal Anterior Cingulate Cortex during a Mixed-Strategy Game , 2007, The Journal of Neuroscience.

[4]  Karl J. Friston,et al.  Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[5]  O. Hikosaka,et al.  Dopamine Neurons Can Represent Context-Dependent Prediction Error , 2004, Neuron.

[6]  Robert Lalonde,et al.  The neurobiological basis of spontaneous alternation , 2002, Neuroscience & Biobehavioral Reviews.

[7]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[8]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[9]  R. Herrnstein,et al.  CHAPTER 5 – Melioration and Behavioral Allocation1 , 1980 .

[10]  D. Barraclough,et al.  Prefrontal cortex and decision making in a mixed-strategy game , 2004, Nature Neuroscience.

[11]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[12]  M. Bitterman PHYLETIC DIFFERENCES IN LEARNING. , 1965, The American psychologist.

[13]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[14]  W. Estes Toward a Statistical Theory of Learning. , 1994 .

[15]  Nir Vulkan An Economist's Perspective on Probability Matching , 2000 .

[16]  D. Barraclough,et al.  Reinforcement learning and decision making in monkeys during a competitive game. , 2004, Brain research. Cognitive brain research.

[17]  K. Doya,et al.  Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[18]  G. Jensen,et al.  Choice as a function of reinforcer "hold": from probability learning to concurrent reinforcement. , 2008, Journal of experimental psychology. Animal behavior processes.

[19]  Concurrent variable-ratio schedules: Implications for the generalized matching law. , 1988, Journal of the experimental analysis of behavior.

[20]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[21]  W. Baum,et al.  Matching, undermatching, and overmatching in studies of choice. , 1979, Journal of the experimental analysis of behavior.

[22]  H. Seo,et al.  Cortical mechanisms for reinforcement learning in competitive games , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[23]  J. O'Doherty,et al.  The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference during Decision Making in Humans , 2006, The Journal of Neuroscience.

[24]  W M Baum,et al.  On two types of deviation from the matching law: bias and undermatching. , 1974, Journal of the experimental analysis of behavior.

[25]  W. N. Dember,et al.  Spontaneous alternation behavior in animals: A review , 1986 .

[26]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[27]  Kenji Doya,et al.  Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter , 2003, NIPS.

[28]  J. Tanji,et al.  Numerical representation for action in the parietal cortex of the monkey , 2002, Nature.

[29]  Daeyeol Lee,et al.  Encoding of action history in the rat ventral striatum. , 2007, Journal of neurophysiology.

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Timothy E. J. Behrens,et al.  Optimal decision making and the anterior cingulate cortex , 2006, Nature Neuroscience.