Computational noise in reward-guided learning drives behavioral variability in volatile environments

When learning the value of actions in volatile environments, humans often make seemingly irrational decisions that fail to maximize expected value. We reasoned that these ‘non-greedy’ decisions, instead of reflecting information seeking during choice, may be caused by computational noise in the learning of action values. Here using reinforcement learning models of behavior and multimodal neurophysiological data, we show that the majority of non-greedy decisions stem from this learning noise. The trial-to-trial variability of sequential learning steps and their impact on behavior could be predicted both by blood oxygen level-dependent responses to obtained rewards in the dorsal anterior cingulate cortex and by phasic pupillary dilation, suggestive of neuromodulatory fluctuations driven by the locus coeruleus–norepinephrine system. Together, these findings indicate that most behavioral variability, rather than reflecting human exploration, is due to the limited computational precision of reward-guided learning.Findling, Skvortsova et al. find that a large fraction of non-greedy decisions that humans make in volatile environments do not stem from exploration but from the limited precision of learning, and further identify its neurophysiological correlates.

[1]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[2]  P. Goldman-Rakic,et al.  Selective prefrontal cortical projections to the region of the locus coeruleus and raphe nuclei in the rhesus monkey , 1984, Brain Research.

[3]  J. Cohen,et al.  The role of locus coeruleus in the regulation of cognitive performance. , 1999, Science.

[4]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[5]  N. Chopin A sequential particle filter method for static models , 2002 .

[6]  Kenneth O. Johnson,et al.  Review: Neural Coding and the Basic Law of Psychophysics , 2002, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[7]  R Turner,et al.  Optimized EPI for fMRI studies of the orbitofrontal cortex , 2003, NeuroImage.

[8]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[9]  Jonathan D. Cohen,et al.  An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. , 2005, Annual review of neuroscience.

[10]  Angela J. Yu,et al.  Uncertainty, Neuromodulation, and Attention , 2005, Neuron.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[12]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[13]  Timothy E. J. Behrens,et al.  Optimal decision making and the anterior cingulate cortex , 2006, Nature Neuroscience.

[14]  Nikolaus Weiskopf,et al.  Optimal EPI parameters for reduction of susceptibility-induced BOLD sensitivity losses: A whole-brain analysis at 3 T and 1.5 T , 2006, NeuroImage.

[15]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[16]  Angela J. Yu,et al.  Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[17]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[18]  K. Doya Modulators of decision making , 2008, Nature Neuroscience.

[19]  Jonathan D. Cohen,et al.  Sequential effects: Superstition or rational behavior? , 2008, NIPS.

[20]  Timothy Edward John Behrens,et al.  How Green Is the Grass on the Other Side? Frontopolar Cortex and the Evidence in Favor of Alternative Courses of Action , 2009, Neuron.

[21]  Karl J. Friston,et al.  Bayesian model selection for group studies , 2009, NeuroImage.

[22]  Jeffrey N. Rouder,et al.  Bayesian t tests for accepting and rejecting the null hypothesis , 2009, Psychonomic bulletin & review.

[23]  N. Daw,et al.  Human Reinforcement Learning Subdivides Structured Action Spaces by Learning Effector-Specific Values , 2009, The Journal of Neuroscience.

[24]  Andrew M. Clark,et al.  Stimulus onset quenches neural variability: a widespread cortical phenomenon , 2010, Nature Neuroscience.

[25]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[26]  Nicolas Chopin,et al.  SMC2: an efficient algorithm for sequential analysis of state space models , 2011, 1101.1528.

[27]  Sander Nieuwenhuis,et al.  Pupil Diameter Predicts Changes in the Exploration–Exploitation Trade-off: Evidence for the Adaptive Gain Theory , 2011, Journal of Cognitive Neuroscience.

[28]  A. Pouget,et al.  Not Noisy, Just Wrong: The Role of Suboptimal Inference in Behavioral Variability , 2012, Neuron.

[29]  Mark W. Woolrich,et al.  FSL , 2012, NeuroImage.

[30]  Jonathan D. Cohen,et al.  The effects of neural gain on attention and learning , 2013, Nature Neuroscience.

[31]  Fredrik Lindsten,et al.  Backward Simulation Methods for Monte Carlo Statistical Inference , 2013, Found. Trends Mach. Learn..

[32]  Jonathan D. Cohen,et al.  The Expected Value of Control: An Integrative Theory of Anterior Cingulate Cortex Function , 2013, Neuron.

[33]  Etienne Koechlin,et al.  Foundations of human reasoning in the prefrontal cortex , 2014, Science.

[34]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[35]  Jonathan D. Cohen,et al.  Humans use directed and random exploration to solve the explore-exploit dilemma. , 2014, Journal of experimental psychology. General.

[36]  K. Branson,et al.  Behavioral Variability through Stochastic Choice and Its Gating by Anterior Cingulate Cortex , 2014, Cell.

[37]  M. Khamassi,et al.  Contextual modulation of value signals in reward and punishment learning , 2015, Nature Communications.

[38]  Samuel Gershman,et al.  A Unifying Probabilistic View of Associative Learning , 2015, PLoS Comput. Biol..

[39]  Timothy E. J. Behrens,et al.  Anxious individuals have difficulty learning the causal statistics of aversive environments , 2015, Nature Neuroscience.

[40]  J. Gold,et al.  Relationships between Pupil Diameter and Neuronal Activity in the Locus Coeruleus, Colliculi, and Cingulate Cortex , 2016, Neuron.

[41]  Jan Drugowitsch,et al.  Computational Precision of Mental Inference as Critical Source of Human Choice Suboptimality , 2016, Neuron.

[42]  Valentin Wyart,et al.  Choice variability and suboptimality in uncertain environments , 2016, Current Opinion in Behavioral Sciences.

[43]  Jacqueline Scholl,et al.  Simultaneous representation of a spectrum of dynamically changing value estimates during decision making , 2017, Nature Communications.

[44]  E. Koechlin,et al.  The Importance of Falsification in Computational Cognitive Modeling , 2017, Trends in Cognitive Sciences.

[45]  C. H. Donahue,et al.  Metaplasticity as a Neural Substrate for Adaptive Learning and Choice under Uncertainty , 2017, Neuron.

[46]  Jonathan D. Cohen,et al.  The effect of atomoxetine on random and directed exploration in humans , 2017, PloS one.

[47]  Nathaniel D. Daw,et al.  Increased locus coeruleus tonic activity causes disengagement from a patch-foraging task , 2017, Cognitive, Affective, & Behavioral Neuroscience.

[48]  Ninon Burgos,et al.  New advances in the Clinica software platform for clinical neuroimaging studies , 2019 .