Sources of suboptimality in a minimalistic explore–exploit task

People often choose between sticking with an available good option (exploitation) and trying out a new option that is uncertain but potentially more rewarding (exploration)1,2. Laboratory studies on explore–exploit decisions often contain real-world complexities such as non-stationary environments, stochasticity under exploitation and unknown reward distributions3–7. However, such factors might limit the researcher’s ability to understand the essence of people’s explore–exploit decisions. For this reason, we introduce a minimalistic task in which the optimal policy is to start off exploring and to switch to exploitation at most once in each sequence of decisions. The behaviour of 49 laboratory and 143 online participants deviated both qualitatively and quantitatively from the optimal policy, even when allowing for bias and decision noise. Instead, people seem to follow a suboptimal rule in which they switch from exploration to exploitation when the highest reward so far exceeds a certain threshold. Moreover, we show that this threshold decreases approximately linearly with the proportion of the sequence that remains, suggesting a temporal ratio law. Finally, we find evidence for ‘sequence-level’ variability that is shared across all decisions in the same sequence. Our results emphasize the importance of examining sequence-level strategies and their variability when studying sequential decision-making.How good are people at choosing between exploration and exploitation? In a task that captures the essence of such decisions, Song et al. found systematic deviations from optimality that were associated with the sequence of decisions participants can make.

[1]  Masataka Watanabe Reward expectancy in primate prefrental neurons , 1996, Nature.

[2]  Ke Sang,et al.  Modeling exploration/exploitation behavior and the effect of individual differences , 2017 .

[3]  Gordon D. A. Brown,et al.  A temporal ratio model of memory. , 2007, Psychological review.

[4]  J. Cavanaugh Unifying the derivations for the Akaike and corrected Akaike information criteria , 1997 .

[5]  N. Daw,et al.  Learning the opportunity cost of time in a patch-foraging task , 2015, Cognitive, Affective, & Behavioral Neuroscience.

[6]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[7]  T. Ormerod,et al.  Human performance on the traveling salesman problem , 1996, Perception & psychophysics.

[8]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[9]  Emmanuel Procyk,et al.  Specific frontal neural dynamics contribute to decisions to check , 2016, Nature Communications.

[10]  Catalin V. Buhusi,et al.  What makes us tick? Functional and neural mechanisms of interval timing , 2005, Nature Reviews Neuroscience.

[11]  Robert L. Goldstone,et al.  Learning near-optimal search in a minimal explore/exploit task , 2011, CogSci.

[12]  Darryl A. Seale,et al.  Optimal stopping behavior with relative ranks: the secretary problem with unknown population size , 2000 .

[13]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[14]  Gerd Gigerenzer,et al.  Heuristic decision making. , 2011, Annual review of psychology.

[15]  Jonathan D. Cohen,et al.  Humans use directed and random exploration to solve the explore-exploit dilemma. , 2014, Journal of experimental psychology. General.

[16]  Ben R. Newell,et al.  Unpacking the Exploration–Exploitation Tradeoff: A Synthesis of Human and Animal Literatures , 2015 .

[17]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[18]  H. Simon,et al.  Rational choice and the structure of the environment. , 1956, Psychological review.

[19]  Ben R. Newell,et al.  Learning and choosing in an uncertain world: An investigation of the explore–exploit dilemma in static and dynamic environments , 2016, Cognitive Psychology.

[20]  J. Gibbon Scalar expectancy theory and Weber's law in animal timing. , 1977 .

[21]  Marco K. Wittmann,et al.  Multiple Neural Mechanisms of Decision Making and Their Competition under Changing Risk Pressure , 2014, Neuron.

[22]  Todd M. Gureckis,et al.  Exploratory Choice Reflects the Future Value of Information , 2018, Decision.

[23]  N. Chater,et al.  Simplicity: a unifying principle in cognitive science? , 2003, Trends in Cognitive Sciences.

[24]  Christopher T. Kello,et al.  Scaling laws in cognitive sciences , 2010, Trends in Cognitive Sciences.

[25]  M. Lee,et al.  A Bayesian analysis of human decision-making on bandit problems , 2009 .

[26]  Darryl A. Seale,et al.  Sequential Decision Making with Relative Ranks: An Experimental Investigation of the "Secretary Problem"> , 1997 .

[27]  P. Stone,et al.  The Nature of Belief-Directed Exploratory Choice in Human Decision-Making , 2011, Front. Psychology.

[28]  Timothy Edward John Behrens,et al.  How Green Is the Grass on the Other Side? Frontopolar Cortex and the Evidence in Favor of Alternative Courses of Action , 2009, Neuron.

[29]  Karl J. Friston,et al.  Bayesian model selection for group studies , 2009, NeuroImage.

[30]  Angela J. Yu,et al.  Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[31]  K. Doya,et al.  Validation of Decision-Making Models and Analysis of Decision Variables in the Rat Basal Ganglia , 2009, The Journal of Neuroscience.

[32]  E. Charnov Optimal foraging, the marginal value theorem. , 1976, Theoretical population biology.

[33]  Jessica B. Hamrick,et al.  psiTurk: An open-source framework for conducting replicable behavioral experiments online , 2016, Behavior research methods.

[34]  E. Miller,et al.  Neuronal activity in primate dorsolateral and orbital prefrontal cortex during performance of a reward preference task , 2003, The European journal of neuroscience.

[35]  Y. C. Hsiao An experimental investigation of the secretary problem : factors affecting sequential search behaviour. , 2018 .

[36]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[37]  E. Miller,et al.  An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[38]  Wei Ji Ma,et al.  A computational model for decision tree search , 2017, CogSci.

[39]  Nicole M. Long,et al.  Supplemental Figure , 2013 .

[40]  Sharon M. Gray Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior. , 2003 .

[41]  Thomas T. Hills,et al.  The central executive as a search process: priming exploration and exploitation across domains. , 2010, Journal of experimental psychology. General.

[42]  Paul R. Schrater,et al.  Bayesian modeling of human sequential decision-making on the multi-armed bandit problem , 2008 .

[43]  D. Barraclough,et al.  Prefrontal cortex and decision making in a mixed-strategy game , 2004, Nature Neuroscience.

[44]  Karl J. Friston,et al.  Bayesian model selection for group studies — Revisited , 2014, NeuroImage.

[45]  Michael D. Lee,et al.  Psychological models of human and optimal performance in bandit problems , 2011, Cognitive Systems Research.

[46]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[47]  H. Akaike A new look at the statistical model identification , 1974 .