Human Reinforcement Learning Subdivides Structured Action Spaces by Learning Effector-Specific Values

Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable because of the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning—such as prediction error signals for action valuation associated with dopamine and the striatum—can cope with this “curse of dimensionality.” We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and blood oxygen level-dependent (BOLD) activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector, rather than a traditional model that treated the bimanual actions as unitary with a single value. A decomposition of value into effector-specific components was also observed in value-related BOLD signaling, in the form of lateralized biases in striatal correlates of prediction error and anticipatory value correlates in the intraparietal sulcus. These results suggest that the human brain can use decomposed value representations to “divide and conquer” reinforcement learning over high-dimensional action spaces.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[3]  H. Sakata,et al.  Parietal neurons related to memory-guided hand manipulation. , 1996, Journal of neurophysiology.

[4]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[5]  R. Andersen,et al.  Coding of intention in the posterior parietal cortex , 1997, Nature.

[6]  D H Brainard,et al.  The Psychophysics Toolbox. , 1997, Spatial vision.

[7]  Zoubin Ghahramani,et al.  Modular decomposition in visuomotor learning , 1997, Nature.

[8]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[9]  Michael L. Platt,et al.  Neural correlates of decision variables in parietal cortex , 1999, Nature.

[10]  A P Batista,et al.  Reach plans in eye-centered coordinates. , 1999, Science.

[11]  Colin Camerer,et al.  Experience‐weighted Attraction Learning in Normal Form Games , 1999 .

[12]  Nikolaus R. McFarland,et al.  Striatonigrostriatal Pathways in Primates Form an Ascending Spiral from the Shell to the Dorsolateral Striatum , 2000, The Journal of Neuroscience.

[13]  Yale E. Cohen,et al.  A common reference frame for movement plans in the posterior parietal cortex , 2002, Nature Reviews Neuroscience.

[14]  O. Hikosaka,et al.  Visual and Anticipatory Bias in Three Cortical Eye Fields of the Monkey during an Adaptive Decision-Making Task , 2002, The Journal of Neuroscience.

[15]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[16]  P. Montague,et al.  Activity in human ventral striatum locked to errors of reward prediction , 2002, Nature Neuroscience.

[17]  M. Corbetta,et al.  Functional Organization of Human Intraparietal and Frontal Cortex for Attending, Looking, and Pointing , 2003, The Journal of Neuroscience.

[18]  Leslie Pack Kaelbling,et al.  All learning is Local: Multi-agent Learning in Global Reward Games , 2003, NIPS.

[19]  Samuel M. McClure,et al.  Temporal Prediction Errors in a Passive Learning Task Activate Human Striatum , 2003, Neuron.

[20]  Karl J. Friston,et al.  Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[21]  Richard A. Andersen,et al.  FMRI evidence for a 'parietal reach region' in the human brain , 2003, Experimental Brain Research.

[22]  Paul Cisek,et al.  Mechanisms of selection and guidance of reaching movements in the parietal lobe. , 2003, Advances in neurology.

[23]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[24]  Y. Miyashita,et al.  Functional Magnetic Resonance Imaging of Macaque Monkeys Performing Visually Guided Saccade Tasks Comparison of Cortical Eye Fields with Humans , 2004, Neuron.

[25]  W. Newsome,et al.  Matching Behavior and the Representation of Value in the Parietal Cortex , 2004, Science.

[26]  R. Andersen,et al.  Cognitive Control Signals for Neural Prosthetics , 2004, Science.

[27]  Gereon R. Fink,et al.  Human medial intraparietal cortex subserves visuomotor coordinate transformation , 2004, NeuroImage.

[28]  J. O'Doherty,et al.  Reward representations and reward-related learning in the human brain: insights from neuroimaging , 2004, Current Opinion in Neurobiology.

[29]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[30]  G. Fink,et al.  REVIEW: The functional organization of the intraparietal sulcus in humans and monkeys , 2005, Journal of anatomy.

[31]  K. Doya,et al.  Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[32]  J. Tenenbaum,et al.  Structure and strength in causal induction , 2005, Cognitive Psychology.

[33]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[34]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[35]  Brian Knutson,et al.  Linking nucleus accumbens dopamine and blood oxygenation , 2007, Psychopharmacology.

[36]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[37]  Michael J. Frank,et al.  Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia , 2006, Neural Computation.

[38]  R. Dolan,et al.  Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans , 2006, Nature.

[39]  J. O'Doherty,et al.  The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference during Decision Making in Humans , 2006, The Journal of Neuroscience.

[40]  R. Andersen,et al.  Dorsal Premotor Neurons Encode the Relative Position of the Hand, Eye, and Goal during Reach Planning , 2006, Neuron.

[41]  David S. Touretzky,et al.  Representation and Timing in Theories of the Dopamine System , 2006, Neural Computation.

[42]  K. Doya,et al.  The computational neurobiology of learning and reward , 2006, Current Opinion in Neurobiology.

[43]  Aaron C. Courville,et al.  Bayesian theories of conditioning in a changing world , 2006, Trends in Cognitive Sciences.

[44]  N. Daw,et al.  Reinforcement Learning Signals in the Human Striatum Distinguish Learners from Nonlearners during Reward-Based Decision Making , 2007, The Journal of Neuroscience.

[45]  J. O'Doherty,et al.  Orbitofrontal Cortex Encodes Willingness to Pay in Everyday Economic Transactions , 2007, The Journal of Neuroscience.

[46]  M. Roesch,et al.  Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards , 2007, Nature Neuroscience.

[47]  Kenji Doya,et al.  Multiple model-based reinforcement learning explains dopamine neuronal activity , 2007, Neural Networks.

[48]  M. Delgado,et al.  Reward‐Related Responses in the Human Striatum , 2007, Annals of the New York Academy of Sciences.

[49]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[50]  R. Andersen,et al.  Posterior Parietal Cortex Encodes Autonomously Selected Motor Plans , 2007, Neuron.

[51]  Colin Camerer,et al.  Dissociating the Role of the Orbitofrontal Cortex and the Striatum in the Computation of Goal Values and Prediction Errors , 2008, The Journal of Neuroscience.

[52]  Charles Kemp,et al.  The discovery of structural form , 2008, Proceedings of the National Academy of Sciences.

[53]  Bijan Pesaran,et al.  Free choice activates a decision circuit between frontal and parietal cortex , 2008, Nature.

[54]  P. Dayan,et al.  Reinforcement learning: The Good, The Bad and The Ugly , 2008, Current Opinion in Neurobiology.

[55]  P. Glimcher,et al.  Value Representations in the Primate Striatum during Matching Behavior , 2008, Neuron.

[56]  J. Gläscher,et al.  Determining a role for ventromedial prefrontal cortex in encoding action-based value signals during reward-related decision making. , 2009, Cerebral cortex.

[57]  W. K. Simmons,et al.  Circular analysis in systems neuroscience: the dangers of double dipping , 2009, Nature Neuroscience.

[58]  Daeyeol Lee,et al.  Behavioral and Neural Changes after Gains and Losses of Conditioned Reinforcers , 2009, The Journal of Neuroscience.

[59]  Timothy Edward John Behrens,et al.  How Green Is the Grass on the Other Side? Frontopolar Cortex and the Evidence in Favor of Alternative Courses of Action , 2009, Neuron.

[60]  F. Fujiyama,et al.  Single Nigrostriatal Dopaminergic Neurons Form Widely Spread and Highly Dense Axonal Arborizations in the Neostriatum , 2009, The Journal of Neuroscience.

[61]  M. Ungless,et al.  Phasic excitation of dopamine neurons in ventral VTA by noxious stimuli , 2009, Proceedings of the National Academy of Sciences.

[62]  A. Cooper,et al.  Predictive Reward Signal of Dopamine Neurons , 2011 .