Reward-based training of recurrent neural networks for cognitive and value-based tasks

Trained neural network models, which exhibit features of neural activity recorded from behaving animals, may provide insights into the circuit mechanisms of cognitive functions through systematic analysis of network activity and connectivity. However, in contrast to the graded error signals commonly used to train networks through supervised learning, animals learn from reward feedback on definite actions through reinforcement learning. Reward maximization is particularly relevant when optimal behavior depends on an animal’s internal judgment of confidence or subjective preferences. Here, we implement reward-based training of recurrent neural networks in which a value network guides learning by using the activity of the decision network to predict future reward. We show that such models capture behavioral and electrophysiological findings from well-known experimental paradigms. Our work provides a unified framework for investigating diverse cognitive and value-based computations, and predicts a role for value representation that is essential for learning, but not executing, a task. DOI: http://dx.doi.org/10.7554/eLife.21492.001

[1]  Xiao-Jing Wang,et al.  Synaptic computation underlying probabilistic inference , 2010, Nature Neuroscience.

[2]  Xiao-Jing Wang,et al.  The importance of mixed selectivity in complex cognitive tasks , 2013, Nature.

[3]  Y. Niv,et al.  Silencing the Critics: Understanding the Effects of Cocaine Sensitization on Dorsolateral and Ventral Striatum in the Context of an Actor/Critic Model , 2008, Front. Neurosci..

[4]  K. Koketsu,et al.  Cholinergic and inhibitory synapses in a pathway from motor‐axon collaterals to motoneurones , 1954, The Journal of physiology.

[5]  N. Parga,et al.  Dynamic Control of Response Criterion in Premotor Cortex during Perceptual Detection under Temporal Uncertainty , 2015, Neuron.

[6]  Daniel L. K. Yamins,et al.  Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition , 2014, PLoS Comput. Biol..

[7]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[8]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[9]  Pieter R. Roelfsema,et al.  Reinforcement Learning of Linking and Tracing Contours in Recurrent Neural Networks , 2015, PLoS Comput. Biol..

[10]  Jonathan D. Cohen,et al.  Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement , 2008, NIPS.

[11]  Xiao-Jing Wang,et al.  A Recurrent Network Mechanism of Time Integration in Perceptual Decisions , 2006, The Journal of Neuroscience.

[12]  M. Shadlen,et al.  Response of Neurons in the Lateral Intraparietal Area during a Combined Visual Discrimination Reaction Time Task , 2002, The Journal of Neuroscience.

[13]  W. Gerstner,et al.  Optimal Control of Transient Dynamics in Balanced Networks Supports Generation of Complex Movements , 2014, Neuron.

[14]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[15]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[16]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[17]  Yael Niv,et al.  Reinforcement learning with Marr , 2016, Current Opinion in Behavioral Sciences.

[18]  Wolfgang Maass,et al.  Emergence of complex computational structures from chaotic neural networks through reward-modulated Hebbian learning. , 2014, Cerebral cortex.

[19]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[20]  N. Daw,et al.  Signals in Human Striatum Are Appropriate for Policy Update Rather than Value Prediction , 2011, The Journal of Neuroscience.

[21]  R. Romo,et al.  Neuronal correlates of parametric working memory in the prefrontal cortex , 1999, Nature.

[22]  Azad Adam,et al.  The Functional Requirements , 2007 .

[23]  Xiao-Jing Wang,et al.  Probabilistic Decision Making by Slow Reverberation in Cortical Circuits , 2002, Neuron.

[24]  Francesca Mastrogiuseppe,et al.  Intrinsically-generated fluctuating activity in excitatory-inhibitory networks , 2016, PLoS Comput. Biol..

[25]  David Sussillo,et al.  Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks , 2013, Neural Computation.

[26]  Henning Sprekeler,et al.  Functional Requirements for Reward-Modulated Spike-Timing-Dependent Plasticity , 2010, The Journal of Neuroscience.

[27]  G. Schoenbaum,et al.  What the orbitofrontal cortex does not do , 2015, Nature Neuroscience.

[28]  O. Hikosaka,et al.  Basal ganglia circuits for reward value-guided behavior. , 2014, Annual review of neuroscience.

[29]  J. Wallis Orbitofrontal cortex and its contribution to decision-making. , 2007, Annual review of neuroscience.

[30]  Richard A. Andersen,et al.  A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons , 1988, Nature.

[31]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[32]  Jessica Lowell Neural Network , 2001 .

[33]  Patrice Simardy,et al.  Learning Long-Term Dependencies with , 2007 .

[34]  Xiao-Jing Wang,et al.  Internal Representation of Task Rules by Recurrent Dynamics: The Importance of the Diversity of Neural Responses , 2010, Front. Comput. Neurosci..

[35]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[36]  Surya Ganguli,et al.  On simplicity and complexity in the brave new world of large-scale neuroscience , 2015, Current Opinion in Neurobiology.

[37]  Joel L. Davis,et al.  Adaptive Critics and the Basal Ganglia , 1995 .

[38]  M. Frank,et al.  Anatomy of a decision: striato-orbitofrontal interactions in reinforcement learning, decision making, and reversal. , 2006, Psychological review.

[39]  C. Padoa-Schioppa,et al.  Neurons in the orbitofrontal cortex encode economic value , 2006, Nature.

[40]  David Sussillo,et al.  Neural circuits as computational dynamical systems , 2014, Current Opinion in Neurobiology.

[41]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[42]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[43]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[44]  Máté Lengyel,et al.  Goal-Directed Decision Making with Spiking Neurons , 2016, The Journal of Neuroscience.

[45]  Ha Hong,et al.  Explicit information for category-orthogonal object properties increases along the ventral stream , 2016, Nature Neuroscience.

[46]  RuppinEytan,et al.  Actor-critic models of the basal ganglia , 2002 .

[47]  W. Newsome,et al.  Context-dependent computation by recurrent dynamics in prefrontal cortex , 2013, Nature.

[48]  Konrad P. Körding,et al.  Toward an Integration of Deep Learning and Neuroscience , 2016, bioRxiv.

[49]  Robert C. Wilson,et al.  Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortex , 2011, Nature Neuroscience.

[50]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[51]  J. Hollerman,et al.  Reward processing in primate orbitofrontal cortex and basal ganglia. , 2000, Cerebral cortex.

[52]  Eytan Ruppin,et al.  Actor-critic models of the basal ganglia: new anatomical and computational perspectives , 2002, Neural Networks.

[53]  P. Dayan,et al.  Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[54]  H. Seung,et al.  Model of birdsong learning based on gradient estimation by dynamic perturbation of neural conductances. , 2007, Journal of neurophysiology.

[55]  Xiao-Jing Wang,et al.  Neural mechanism for stochastic behaviour during a competitive game , 2006, Neural Networks.

[56]  W. Senn,et al.  Reinforcement learning in populations of spiking neurons , 2008, Nature Neuroscience.

[57]  L. F. Abbott,et al.  Generating Coherent Patterns of Activity from Chaotic Neural Networks , 2009, Neuron.

[58]  A. Koulakov,et al.  Orbitofrontal Cortex Is Required for Optimal Waiting Based on Decision Confidence , 2014, Neuron.

[59]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[60]  Thomas Miconi,et al.  Biologically plausible learning in recurrent neural networks for flexible decision tasks , 2022 .

[61]  W. Newsome,et al.  Choosing the greater of two goods: neural currencies for valuation and decision making , 2005, Nature Reviews Neuroscience.

[62]  Marisa Kellam,et al.  Silencing Critics , 2016 .

[63]  Gregor Hohpe,et al.  Toward Integration , 2002 .

[64]  H. Seo,et al.  A reservoir of time constants for memory traces in cortical neurons , 2011, Nature Neuroscience.

[65]  J. Gold,et al.  The neural basis of decision making. , 2007, Annual review of neuroscience.

[66]  W. Schultz Midbrain Dopamine Neurons , 2009 .

[67]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[68]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[69]  Timothy D. Hanks,et al.  Bounded Integration in Parietal Cortex Underlies Decisions Even When Viewing Duration Is Dictated by the Environment , 2008, The Journal of Neuroscience.

[70]  E. Rolls,et al.  The Orbitofrontal Cortex , 2019 .

[71]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[72]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[73]  Matthew T. Kaufman,et al.  A neural network that finds a naturalistic solution for the production of muscle activity , 2015, Nature Neuroscience.

[74]  Jan Peters,et al.  Policy Gradient Methods , 2010, Encyclopedia of Machine Learning.

[75]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[76]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[77]  T. Maia Two-factor theory, the actor-critic model, and conditioned avoidance , 2010, Learning & behavior.

[78]  Rajesh P. N. Rao Decision Making Under Uncertainty: A Neural Model Based on Partially Observable Markov Decision Processes , 2010, Front. Comput. Neurosci..

[79]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[80]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[81]  David Raposo,et al.  Multisensory Decision-Making in Rats and Humans , 2012, The Journal of Neuroscience.

[82]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[83]  Ila R Fiete,et al.  Gradient learning in spiking neural networks by dynamic perturbation of conductances. , 2006, Physical review letters.

[84]  Wulfram Gerstner,et al.  Does computational neuroscience need new synaptic learning paradigms? , 2016, Current Opinion in Behavioral Sciences.

[85]  David J. Freedman,et al.  Choice-correlated activity fluctuations underlie learning of neuronal category representation , 2015, Nature Communications.

[86]  H. Seung,et al.  Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[87]  E. Izhikevich Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[88]  Daniel Cownden,et al.  Random feedback weights support learning in deep neural networks , 2014, ArXiv.

[89]  M. Shadlen,et al.  A role for neural integrators in perceptual decision making. , 2003, Cerebral cortex.

[90]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[91]  Christopher D. Harvey,et al.  Recurrent Network Models of Sequence Generation and Memory , 2016, Neuron.

[92]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[93]  Dean V. Buonomano,et al.  ROBUST TIMING AND MOTOR PATTERNS BY TAMING CHAOS IN RECURRENT NEURAL NETWORKS , 2012, Nature Neuroscience.

[94]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[95]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[96]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[97]  Colin J. Akerman,et al.  Random synaptic feedback weights support error backpropagation for deep learning , 2016, Nature Communications.

[98]  Yoshua Bengio,et al.  Towards a Biologically Plausible Backprop , 2016, ArXiv.

[99]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[100]  Barbara Hammer,et al.  Learning with recurrent neural networks , 2000 .

[101]  P. Dayan,et al.  Reward, Motivation, and Reinforcement Learning , 2002, Neuron.

[102]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[103]  G. Schoenbaum,et al.  Does the orbitofrontal cortex signal value? , 2011, Annals of the New York Academy of Sciences.

[104]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[105]  Xiao-Jing Wang Decision Making in Recurrent Neuronal Circuits , 2008, Neuron.

[106]  Christian K. Machens,et al.  Behavioral / Systems / Cognitive Functional , But Not Anatomical , Separation of “ What ” and “ When ” in Prefrontal Cortex , 2009 .

[107]  Xiao-Jing Wang,et al.  Confidence estimation as a stochastic process in a neurodynamical system of decision making. , 2015, Journal of neurophysiology.

[108]  M. Desmurget,et al.  Basal ganglia contributions to motor control: a vigorous tutor , 2010, Current Opinion in Neurobiology.

[109]  Konrad P. Kording,et al.  Towards an integration of deep learning and neuroscience , 2016, bioRxiv.

[110]  Guangyu R. Yang,et al.  Training Excitatory-Inhibitory Recurrent Neural Networks for Cognitive Tasks: A Simple and Flexible Framework , 2016, PLoS Comput. Biol..

[111]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[112]  Matthew T. Kaufman,et al.  A category-free neural population supports evolving demands during decision-making , 2014, Nature Neuroscience.

[113]  L. Abbott,et al.  From fixed points to chaos: Three models of delayed discrimination , 2013, Progress in Neurobiology.

[114]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[115]  Hatim A. Zariwala,et al.  Neural correlates, computation and behavioural impact of decision confidence , 2008, Nature.

[116]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[117]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[118]  M. Shadlen,et al.  Representation of Confidence Associated with a Decision by Neurons in the Parietal Cortex , 2009, Science.

[119]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.