A model-based reinforcement learning: a computational model and an fMRI study

In this paper, we discuss an optimal decision-making problem in an unknown environment on the bases of both machine learning and brain learning. We present a model-based reinforcement learning (RL) in which the environment is directly estimated. Our RL performs action selection according to the detection of environmental changes and the current value function. In a partially-observable situation, in which the environment includes unobservable state variables, our RL incorporates estimation of unobservable variables. We propose a possible functional model of our RL, focusing on the prefrontal cortex and the anterior cingulate cortex. To test the model, we conducted a human imaging study during a sequential learning task, and found significant activations in the dorsolateral prefrontal cortex and the anterior cingulate cortex during RL. From a comparison of the mean activations in the earlier and later learning phases, we suggest that the dorsolateral prefrontal cortex maintains and manipulates the environmental model, while the anterior cingulate cortex is related to the uncertainty of action selection. These experimental results are consistent with our model.

[1]  E. Koechlin,et al.  Dissociating the role of the medial and lateral anterior prefrontal cortex in human planning. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2]  T. Braver,et al.  Anterior cingulate cortex and response conflict: effects of frequency, inhibition and errors. , 2001, Cerebral cortex.

[3]  Masataka Watanabe Reward expectancy in primate prefrental neurons , 1996, Nature.

[4]  P. Dayan,et al.  Exploration bonuses and dual control , 1996 .

[5]  A. Dale,et al.  Dorsal anterior cingulate cortex: A role in reward-based decision making , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  E. Miller,et al.  Neuronal activity in primate dorsolateral and orbital prefrontal cortex during performance of a reward preference task , 2003, The European journal of neuroscience.

[7]  M. Botvinick,et al.  Anterior cingulate cortex, error detection, and the online monitoring of performance. , 1998, Science.

[8]  E. Rolls,et al.  Abstract reward and punishment representations in the human orbitofrontal cortex , 2001, Nature Neuroscience.

[9]  D Le Bihan,et al.  The Dorsolateral Prefrontal Cortex (dlpfc) Plays a Key Role in Working Memory (wm). yet Its Precise Contribution , 2022 .

[10]  Jordan Grafman,et al.  Handbook of Neuropsychology , 1991 .

[11]  J. Tanji,et al.  Role for cingulate motor area cells in voluntary movement selection based on reward. , 1998, Science.

[12]  E. Rolls,et al.  The Orbitofrontal Cortex , 2019 .

[13]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[14]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[15]  E. G. Jones Cerebral Cortex , 1987, Cerebral Cortex.

[16]  R. K. Simpson Nature Neuroscience , 2022 .

[17]  Ronald A. Howard,et al.  Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[18]  Edward E. Smith,et al.  Temporal dynamics of brain activation during a working memory task , 1997, Nature.

[19]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[20]  R. Passingham,et al.  The prefrontal cortex: response selection or maintenance within working memory? , 2000, 5th IEEE EMBS International Summer School on Biomedical Imaging, 2002..

[21]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[22]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[23]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[24]  M. Petrides Comparative architectonic analysis of the human and the macaque frontal cortex , 1994 .

[25]  J. Tanji,et al.  Behavioral planning in the prefrontal cortex , 2001, Current Opinion in Neurobiology.

[26]  Junichiro Yoshimoto,et al.  Control of exploitation-exploration meta-parameter in reinforcement learning , 2002, Neural Networks.

[27]  W. Schultz,et al.  Dopamine responses comply with basic assumptions of formal learning theory , 2001, Nature.

[28]  J. Fuster Prefrontal Cortex , 2018 .

[29]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[30]  M. Shadlen,et al.  Effect of Expected Reward Magnitude on the Response of Neurons in the Dorsolateral Prefrontal Cortex of the Macaque , 1999, Neuron.

[31]  W. Schultz,et al.  Relative reward preference in primate orbitofrontal cortex , 1999, Nature.

[32]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[33]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[34]  Karl J. Friston,et al.  Anterior prefrontal cortex mediates rule learning in humans. , 2001, Cerebral cortex.

[35]  S C Rao,et al.  Integration of what and where in the primate prefrontal cortex. , 1997, Science.

[36]  Joel L. Davis,et al.  Adaptive Critics and the Basal Ganglia , 1995 .

[37]  Jonathan D. Cohen,et al.  Conflict monitoring versus selection-for-action in anterior cingulate cortex , 1999, Nature.

[38]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[39]  G. Schoenbaum,et al.  Orbitofrontal cortex and basolateral amygdala encode expected outcomes during learning , 1998, Nature Neuroscience.

[40]  A. Barto,et al.  Adaptive Critics and the Basal Ganglia , 1994 .