Reward-based training of recurrent neural networks for cognitive and value-based tasks

Trained neural network models, which exhibit many features observed in neural recordings from behaving animals and whose activity and connectivity can be fully analyzed, may provide insights into neural mechanisms. In contrast to commonly used methods for supervised learning from graded error signals, however, animals learn from reward feedback on definite actions through reinforcement learning. Reward maximization is particularly relevant when the optimal behavior depends on an animal’s internal judgment of confidence or subjective preferences. Here, we describe reward-based training of recurrent neural networks in which a value network guides learning by using the selected actions and activity of the policy network to predict future reward. We show that such models capture both behavioral and electrophysiological findings from well-known experimental paradigms. Our results provide a unified framework for investigating diverse cognitive and value-based computations, including a role for value representation that is essential for learning, but not executing, a task.

[1]  K. Koketsu,et al.  Cholinergic and inhibitory synapses in a pathway from motor‐axon collaterals to motoneurones , 1954, The Journal of physiology.

[2]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Richard A. Andersen,et al.  A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons , 1988, Nature.

[5]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6]  Dhanistha Panyasak,et al.  Circuits , 1995, Annals of the New York Academy of Sciences.

[7]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[8]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[9]  R. Romo,et al.  Neuronal correlates of parametric working memory in the prefrontal cortex , 1999, Nature.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[12]  J. Hollerman,et al.  Reward processing in primate orbitofrontal cortex and basal ganglia. , 2000, Cerebral cortex.

[13]  Xiao-Jing Wang,et al.  Probabilistic Decision Making by Slow Reverberation in Cortical Circuits , 2002, Neuron.

[14]  M. Shadlen,et al.  Response of Neurons in the Lateral Intraparietal Area during a Combined Visual Discrimination Reaction Time Task , 2002, The Journal of Neuroscience.

[15]  H. Seung,et al.  Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[16]  M. Shadlen,et al.  A role for neural integrators in perceptual decision making. , 2003, Cerebral cortex.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[20]  W. Newsome,et al.  Choosing the greater of two goods: neural currencies for valuation and decision making , 2005, Nature Reviews Neuroscience.

[21]  Ila R Fiete,et al.  Gradient learning in spiking neural networks by dynamic perturbation of conductances. , 2006, Physical review letters.

[22]  C. Padoa-Schioppa,et al.  Neurons in the orbitofrontal cortex encode economic value , 2006, Nature.

[23]  Xiao-Jing Wang,et al.  A Recurrent Network Mechanism of Time Integration in Perceptual Decisions , 2006, The Journal of Neuroscience.

[24]  Xiao-Jing Wang,et al.  Neural mechanism for stochastic behaviour during a competitive game , 2006, Neural Networks.

[25]  M. Frank,et al.  Anatomy of a decision: striato-orbitofrontal interactions in reinforcement learning, decision making, and reversal. , 2006, Psychological review.

[26]  J. Wallis Orbitofrontal cortex and its contribution to decision-making. , 2007, Annual review of neuroscience.

[27]  J. Gold,et al.  The neural basis of decision making. , 2007, Annual review of neuroscience.

[28]  E. Izhikevich Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[29]  H. Seung,et al.  Model of birdsong learning based on gradient estimation by dynamic perturbation of neural conductances. , 2007, Journal of neurophysiology.

[30]  Timothy D. Hanks,et al.  Bounded Integration in Parietal Cortex Underlies Decisions Even When Viewing Duration Is Dictated by the Environment , 2008, The Journal of Neuroscience.

[31]  Hatim A. Zariwala,et al.  Neural correlates, computation and behavioural impact of decision confidence , 2008, Nature.

[32]  P. Dayan,et al.  Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[33]  Jonathan D. Cohen,et al.  Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement , 2008, NIPS.

[34]  Xiao-Jing Wang Decision Making in Recurrent Neuronal Circuits , 2008, Neuron.

[35]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[36]  M. Shadlen,et al.  Representation of Confidence Associated with a Decision by Neurons in the Parietal Cortex , 2009, Science.

[37]  L. F. Abbott,et al.  Generating Coherent Patterns of Activity from Chaotic Neural Networks , 2009, Neuron.

[38]  W. Senn,et al.  Reinforcement learning in populations of spiking neurons , 2008, Nature Neuroscience.

[39]  Rajesh P. N. Rao Decision Making Under Uncertainty: A Neural Model Based on Partially Observable Markov Decision Processes , 2010, Front. Comput. Neurosci..

[40]  Xiao-Jing Wang,et al.  Internal Representation of Task Rules by Recurrent Dynamics: The Importance of the Diversity of Neural Responses , 2010, Front. Comput. Neurosci..

[41]  Henning Sprekeler,et al.  Functional Requirements for Reward-Modulated Spike-Timing-Dependent Plasticity , 2010, The Journal of Neuroscience.

[42]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[43]  Xiao-Jing Wang,et al.  Synaptic computation underlying probabilistic inference , 2010, Nature Neuroscience.

[44]  M. Desmurget,et al.  Basal ganglia contributions to motor control: a vigorous tutor , 2010, Current Opinion in Neurobiology.

[45]  Christian K. Machens,et al.  Behavioral / Systems / Cognitive Functional , But Not Anatomical , Separation of “ What ” and “ When ” in Prefrontal Cortex , 2009 .

[46]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[47]  G. Schoenbaum,et al.  Does the orbitofrontal cortex signal value? , 2011, Annals of the New York Academy of Sciences.

[48]  H. Seo,et al.  A reservoir of time constants for memory traces in cortical neurons , 2011, Nature Neuroscience.

[49]  Robert C. Wilson,et al.  Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortex , 2011, Nature Neuroscience.

[50]  N. Daw,et al.  Signals in Human Striatum Are Appropriate for Policy Update Rather than Value Prediction , 2011, The Journal of Neuroscience.

[51]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[52]  D. Buonomano,et al.  Complexity without chaos: Plasticity within random recurrent networks generates robust timing and motor control , 2012, 1210.2104.

[53]  David Raposo,et al.  Multisensory Decision-Making in Rats and Humans , 2012, The Journal of Neuroscience.

[54]  David Sussillo,et al.  Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks , 2013, Neural Computation.

[55]  L. Abbott,et al.  From fixed points to chaos: Three models of delayed discrimination , 2013, Progress in Neurobiology.

[56]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[57]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[58]  W. Newsome,et al.  Context-dependent computation by recurrent dynamics in prefrontal cortex , 2013, Nature.

[59]  Xiao-Jing Wang,et al.  The importance of mixed selectivity in complex cognitive tasks , 2013, Nature.

[60]  Dean V. Buonomano,et al.  ROBUST TIMING AND MOTOR PATTERNS BY TAMING CHAOS IN RECURRENT NEURAL NETWORKS , 2012, Nature Neuroscience.

[61]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[62]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[63]  Matthew T. Kaufman,et al.  A category-free neural population supports evolving demands during decision-making , 2014, Nature Neuroscience.

[64]  O. Hikosaka,et al.  Basal ganglia circuits for reward value-guided behavior. , 2014, Annual review of neuroscience.

[65]  David Sussillo,et al.  Neural circuits as computational dynamical systems , 2014, Current Opinion in Neurobiology.

[66]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[67]  Daniel L. K. Yamins,et al.  Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition , 2014, PLoS Comput. Biol..

[68]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[69]  Daniel Cownden,et al.  Random feedback weights support learning in deep neural networks , 2014, ArXiv.

[70]  A. Koulakov,et al.  Orbitofrontal Cortex Is Required for Optimal Waiting Based on Decision Confidence , 2014, Neuron.

[71]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[72]  W. Gerstner,et al.  Optimal Control of Transient Dynamics in Balanced Networks Supports Generation of Complex Movements , 2014, Neuron.

[73]  Wolfgang Maass,et al.  Emergence of complex computational structures from chaotic neural networks through reward-modulated Hebbian learning. , 2014, Cerebral cortex.

[74]  Pieter R. Roelfsema,et al.  Reinforcement Learning of Linking and Tracing Contours in Recurrent Neural Networks , 2015, PLoS Comput. Biol..

[75]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[76]  Xiao-Jing Wang,et al.  Confidence estimation as a stochastic process in a neurodynamical system of decision making. , 2015, Journal of neurophysiology.

[77]  G. Schoenbaum,et al.  What the orbitofrontal cortex does not do , 2015, Nature Neuroscience.

[78]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[79]  N. Parga,et al.  Dynamic Control of Response Criterion in Premotor Cortex during Perceptual Detection under Temporal Uncertainty , 2015, Neuron.

[80]  David J. Freedman,et al.  Choice-correlated activity fluctuations underlie learning of neuronal category representation , 2015, Nature Communications.

[81]  Matthew T. Kaufman,et al.  A neural network that finds a naturalistic solution for the production of muscle activity , 2015, Nature Neuroscience.

[82]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[83]  Surya Ganguli,et al.  On simplicity and complexity in the brave new world of large-scale neuroscience , 2015, Current Opinion in Neurobiology.

[84]  Máté Lengyel,et al.  Goal-Directed Decision Making with Spiking Neurons , 2016, The Journal of Neuroscience.

[85]  Ha Hong,et al.  Explicit information for category-orthogonal object properties increases along the ventral stream , 2016, Nature Neuroscience.

[86]  Konrad P. Kording,et al.  Towards an integration of deep learning and neuroscience , 2016, bioRxiv.

[87]  Guangyu R. Yang,et al.  Training Excitatory-Inhibitory Recurrent Neural Networks for Cognitive Tasks: A Simple and Flexible Framework , 2016, PLoS Comput. Biol..

[88]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[89]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[90]  Christopher D. Harvey,et al.  Recurrent Network Models of Sequence Generation and Memory , 2016, Neuron.

[91]  Thomas Miconi,et al.  Biologically plausible learning in recurrent neural networks for flexible decision tasks , 2022 .

[92]  Yael Niv,et al.  Reinforcement learning with Marr , 2016, Current Opinion in Behavioral Sciences.

[93]  Francesca Mastrogiuseppe,et al.  Intrinsically-generated fluctuating activity in excitatory-inhibitory networks , 2016, PLoS Comput. Biol..