Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration

Many large and small decisions we make in our daily lives—which ice cream to choose, what research projects to pursue, which partner to marry—require an exploration of alternatives before committing to and exploiting the benefits of a particular choice. Furthermore, many decisions require re-evaluation, and further exploration of alternatives, in the face of changing needs or circumstances. That is, often our decisions depend on a higher level choice: whether to exploit well known but possibly suboptimal alternatives or to explore risky but potentially more profitable ones. How adaptive agents choose between exploitation and exploration remains an important and open question that has received relatively limited attention in the behavioural and brain sciences. The choice could depend on a number of factors, including the familiarity of the environment, how quickly the environment is likely to change and the relative value of exploiting known sources of reward versus the cost of reducing uncertainty through exploration. There is no known generally optimal solution to the exploration versus exploitation problem, and a solution to the general case may indeed not be possible. However, there have been formal analyses of the optimal policy under constrained circumstances. There have also been specific suggestions of how humans and animals may respond to this problem under particular experimental conditions as well as proposals about the brain mechanisms involved. Here, we provide a brief review of this work, discuss how exploration and exploitation may be mediated in the brain and highlight some promising future directions for research.

[1]  P. Rabbitt Errors and error correction in choice-response tasks. , 1966, Journal of experimental psychology.

[2]  Allen Newell,et al.  Human Problem Solving. , 1973 .

[3]  G. Ainslie Specious reward: a behavioral theory of impulsiveness and impulse control. , 1975, Psychological bulletin.

[4]  P. Taylor,et al.  Test of optimal sampling by foraging great tits , 1978 .

[5]  J. Pettigrew The role of the locus coeruleus , 1979, Trends in Neurosciences.

[6]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[7]  D. Laming Choice reaction performance following an error , 1979 .

[8]  M. Posner,et al.  Attention and the detection of signals. , 1980, Journal of experimental psychology.

[9]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[10]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[11]  E. Donchin,et al.  Optimizing the use of information: strategic control of activation of responses. , 1992, Journal of experimental psychology. General.

[12]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[13]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[14]  J. Banks,et al.  Switching Costs and the Gittins Index , 1994 .

[15]  G. Aston-Jones,et al.  Locus coeruleus neurons in monkey are selectively activated by attended cues in a vigilance task , 1994, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[16]  D. Alan Allport,et al.  SHIFTING INTENTIONAL SET - EXPLORING THE DYNAMIC CONTROL OF TASKS , 1994 .

[17]  S. Monsell,et al.  Costs of a predictible switch between simple cognitive tasks. , 1995 .

[18]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[19]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[20]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[21]  G. Aston-Jones,et al.  Conditioned responses of monkey locus coeruleus neurons anticipate acquisition of discriminative behavior in a vigilance task , 1997, Neuroscience.

[22]  C. Braun,et al.  Event-Related Brain Potentials Following Incorrect Feedback in a Time-Estimation Task: Evidence for a Generic Neural System for Error Detection , 1997, Journal of Cognitive Neuroscience.

[23]  M. Botvinick,et al.  Anterior cingulate cortex, error detection, and the online monitoring of performance. , 1998, Science.

[24]  L. Carstensen,et al.  Taking time seriously. A theory of socioemotional selectivity. , 1999, The American psychologist.

[25]  J. Cohen,et al.  The role of locus coeruleus in the regulation of cognitive performance. , 1999, Science.

[26]  M. Botvinick,et al.  Parsing executive processes: strategic vs. evaluative functions of the anterior cingulate cortex. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  R. Peyron,et al.  Functional imaging of brain responses to pain. A review and meta-analysis (2000) , 2000, Neurophysiologie Clinique/Clinical Neurophysiology.

[28]  M. Botvinick,et al.  Conflict monitoring and cognitive control. , 2001, Psychological review.

[29]  E. Miller,et al.  An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[30]  E. Rolls,et al.  Abstract reward and punishment representations in the human orbitofrontal cortex , 2001, Nature Neuroscience.

[31]  Clay B. Holroyd,et al.  The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. , 2002, Psychological review.

[32]  Brian Knutson,et al.  A region of mesial prefrontal cortex tracks monetarily rewarding outcomes: characterization with rapid event-related fMRI , 2003, NeuroImage.

[33]  Samuel M. McClure,et al.  A computational substrate for incentive salience , 2003, Trends in Neurosciences.

[34]  Jonathan D. Cohen,et al.  Computational roles for dopamine in behavioural control , 2004, Nature.

[35]  W. Newsome,et al.  Matching Behavior and the Representation of Value in the Parietal Cortex , 2004, Science.

[36]  Samuel M. McClure,et al.  Separate Neural Systems Value Immediate and Delayed Monetary Rewards , 2004, Science.

[37]  Jonathan D. Cohen,et al.  Anterior Cingulate Conflict Monitoring and Adjustments in Control , 2004, Science.

[38]  Jonathan D. Cohen,et al.  The neural basis of error detection: conflict monitoring and the error-related negativity. , 2004, Psychological review.

[39]  Jonathan D. Cohen,et al.  An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. , 2005, Annual review of neuroscience.

[40]  Jonathan D. Cohen,et al.  An exploration-exploitation model based on norepinepherine and dopamine activity , 2005, NIPS.

[41]  Angela J. Yu,et al.  Uncertainty, Neuromodulation, and Attention , 2005, Neuron.

[42]  M. Fricker,et al.  New approaches to investigating the function of mycelial networks , 2005 .

[43]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[44]  Philip Holmes,et al.  Simple Neural Networks that Optimize Decisions , 2005, Int. J. Bifurc. Chaos.

[45]  S. Pratt,et al.  A tunable algorithm for collective decision-making , 2006, Proceedings of the National Academy of Sciences.

[46]  C. Padoa-Schioppa,et al.  Neurons in the orbitofrontal cortex encode economic value , 2006, Nature.

[47]  Jonathan D. Cohen,et al.  Imaging valuation models in human choice. , 2006, Annual review of neuroscience.

[48]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[49]  Naomi Ehrich Leonard,et al.  Collective Motion, Sensor Networks, and Ocean Sampling , 2007, Proceedings of the IEEE.