Prefrontal cortex as a meta-reinforcement learning system

Over the past 20 years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter dopamine ‘stamps in’ associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. We now draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research.Humans and other mammals are prodigious learners, partly because they also ‘learn how to learn’. Wang and colleagues present a new theory showing how learning to learn may arise from interactions between prefrontal cortex and the dopamine system.

[1]  H. Harlow,et al.  The formation of learning sets. , 1949, Psychological review.

[2]  W M Baum,et al.  Optimization and the matching law as accounts of instrumental behavior. , 1981, Journal of the experimental analysis of behavior.

[3]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[4]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[5]  Jieyu Zhao,et al.  Simple Principles of Metalearning , 1996 .

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  J. Tanji,et al.  Role for cingulate motor area cells in voluntary movement selection based on reward. , 1998, Science.

[8]  Sebastian Thrun,et al.  Learning to Learn , 1998, Springer US.

[9]  M. Shadlen,et al.  Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque , 1999, Nature Neuroscience.

[10]  Xiao-Jing Wang Synaptic reverberation underlying mnemonic persistent activity , 2001, Trends in Neurosciences.

[11]  Junichiro Yoshimoto,et al.  Control of exploitation-exploration meta-parameter in reinforcement learning , 2002, Neural Networks.

[12]  Kenji Doya,et al.  Meta-learning in Reinforcement Learning , 2003, Neural Networks.

[13]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[14]  D. Barraclough,et al.  Prefrontal cortex and decision making in a mixed-strategy game , 2004, Nature Neuroscience.

[15]  Jonathan D. Cohen,et al.  Prefrontal cortex and flexible cognitive control: rules without symbols. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[16]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[19]  C. Padoa-Schioppa,et al.  Neurons in the orbitofrontal cortex encode economic value , 2006, Nature.

[20]  Michael J. Frank,et al.  Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia , 2006, Neural Computation.

[21]  J. O'Doherty,et al.  The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference during Decision Making in Humans , 2006, The Journal of Neuroscience.

[22]  Keiji Tanaka,et al.  Medial prefrontal cell activity signaling prediction errors of action values , 2007, Nature Neuroscience.

[23]  Jadin C. Jackson,et al.  Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. , 2007, Psychological review.

[24]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[25]  H. Seo,et al.  Dynamic signals related to choices and outcomes in the dorsolateral prefrontal cortex. , 2007, Cerebral cortex.

[26]  Paul R. Schrater,et al.  Structure Learning in Human Sequential Decision-Making , 2008, NIPS.

[27]  A. Graybiel Habits, rituals, and the evaluative brain. , 2008, Annual review of neuroscience.

[28]  Timothy E. J. Behrens,et al.  Choice, uncertainty and value in prefrontal and cingulate cortex , 2008, Nature Neuroscience.

[29]  H. Seo,et al.  Cortical mechanisms for reinforcement learning in competitive games , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[30]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[31]  S. Kennerley,et al.  Evaluating choices by single neurons in the frontal lobe: outcome value encoded across multiple decision variables , 2009, The European journal of neuroscience.

[32]  Jia Deng,et al.  A large-scale hierarchical image database , 2009, CVPR 2009.

[33]  Simon Hong,et al.  A pallidus-habenula-dopamine pathway signals inferred stimulus values. , 2010, Journal of neurophysiology.

[34]  Robert C. Wilson,et al.  An Approximately Bayesian Delta-Rule Model Explains the Dynamics of Belief Updating in a Changing Environment , 2010, The Journal of Neuroscience.

[35]  P. Glimcher Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis , 2011, Proceedings of the National Academy of Sciences.

[36]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[37]  Robert C. Wilson,et al.  Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortex , 2011, Nature Neuroscience.

[38]  O. Hikosaka,et al.  Learning to represent reward structure: A key to adapting to complex environments , 2012, Neuroscience Research.

[39]  B. Averbeck,et al.  Action Selection and Action Value in Frontal-Striatal Circuits , 2012, Neuron.

[40]  Anne G E Collins,et al.  How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis , 2012, The European journal of neuroscience.

[41]  C. Fiorillo,et al.  Optogenetic Mimicry of the Transient Activation of Dopamine Neurons by Natural Reward Is Sufficient for Operant Reinforcement , 2012, PloS one.

[42]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[43]  Daeyeol Lee,et al.  Signals for Previous Goal Choice Persist in the Dorsomedial, but Not Dorsolateral Striatum of Rats , 2013, The Journal of Neuroscience.

[44]  W. Newsome,et al.  Context-dependent computation by recurrent dynamics in prefrontal cortex , 2013, Nature.

[45]  P. Dayan,et al.  Goals and Habits in the Brain , 2013, Neuron.

[46]  Josiah R. Boivin,et al.  A Causal Link Between Prediction Errors, Dopamine Neurons and Learning , 2013, Nature Neuroscience.

[47]  Peter Ford Dominey,et al.  Medial prefrontal cortex and the adaptive regulation of reinforcement learning parameters. , 2013, Progress in brain research.

[48]  Etienne Koechlin,et al.  Foundations of human reasoning in the prefrontal cortex , 2014, Science.

[49]  Robert C. Wilson,et al.  Orbitofrontal Cortex as a Cognitive Map of Task Space , 2014, Neuron.

[50]  S. Floresco,et al.  Overriding Phasic Dopamine Signals Redirects Action Selection during Risk/Reward Decision Making , 2014, Neuron.

[51]  S. Haber The place of dopamine in the cortico-basal ganglia circuit , 2014, Neuroscience.

[52]  Christopher H Chatham,et al.  Multiple gates on working memory , 2015, Current Opinion in Behavioral Sciences.

[53]  Peter Dayan,et al.  Simple Plans or Sophisticated Habits? State, Transition and Learning Interactions in the Two-Step Task , 2015, bioRxiv.

[54]  Y. Niv,et al.  Discovering latent causes in reinforcement learning , 2015, Current Opinion in Behavioral Sciences.

[55]  Alec Solway,et al.  Reinforcement learning, efficient coding, and the statistics of natural tasks , 2015, Current Opinion in Behavioral Sciences.

[56]  Ilana B. Witten,et al.  Reward and choice encoding in terminals of midbrain dopamine neurons depends on striatal target , 2016, Nature Neuroscience.

[57]  Yael Niv,et al.  A Probability Distribution over Latent Causes, in the Orbitofrontal Cortex , 2016, The Journal of Neuroscience.

[58]  Nicolas W. Schuck,et al.  Human Orbitofrontal Cortex Represents a Cognitive Map of State Space , 2016, Neuron.

[59]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[60]  M. Frank,et al.  Neural signature of hierarchically structured expectations predicts clustering and transfer of rule sets in reinforcement learning , 2016, Cognition.

[61]  Guillem R. Esber,et al.  Brief optogenetic inhibition of dopamine neurons mimics endogenous negative reward prediction errors , 2015, Nature Neuroscience.

[62]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[63]  Geoffrey Schoenbaum,et al.  Midbrain dopamine neurons compute inferred and cached value prediction errors in a common framework , 2016, eLife.

[64]  Wolfram Schultz,et al.  Dopamine reward prediction-error signalling: a two-component response , 2016, Nature Reviews Neuroscience.

[65]  Fabian Grabenhorst,et al.  A dynamic code for economic object valuation in prefrontal cortex neurons , 2016, Nature Communications.

[66]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[67]  Kiyohito Iigaya,et al.  Adaptive learning and decision-making under uncertainty by metaplastic synapses guided by a surprise detection system , 2016, eLife.

[68]  Peter Ford Dominey,et al.  Reservoir Computing Properties of Neural Dynamics in Prefrontal Cortex , 2016, PLoS Comput. Biol..

[69]  S. Gershman,et al.  Dopamine reward prediction errors reflect hidden state inference across time , 2017, Nature Neuroscience.

[70]  Kevin J. Miller,et al.  Dorsal hippocampus contributes to model-based planning , 2017, Nature Neuroscience.

[71]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[72]  Xiao-Jing Wang,et al.  Reward-based training of recurrent neural networks for cognitive and value-based tasks , 2016, bioRxiv.

[73]  N. Daw,et al.  Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework , 2017, Annual review of psychology.

[74]  Shane Legg,et al.  Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents , 2018, ArXiv.

[75]  Horatio M. Morgan,et al.  An Integrative Framework , 2019, Underdog Entrepreneurs.