Psychological models of human and optimal performance in bandit problems

In bandit problems, a decision-maker must choose between a set of alternatives, each of which has a fixed but unknown rate of reward, to maximize their total number of rewards over a sequence of trials. Performing well in these problems requires balancing the need to search for highly-rewarding alternatives, with the need to capitalize on those alternatives already known to be reasonably good. Consistent with this motivation, we develop a new psychological model that relies on switching between latent exploration and exploitation states. We test the model over a range of two-alternative bandit problems, against both human and optimal decision-making data, comparing it to benchmark models from the reinforcement learning literature. By making inferences about the latent states from optimal decision-making behavior, we characterize how people should switch between exploration and exploitation. By making inferences from human data, we begin to characterize how people actually do switch. We discuss the implications of these findings for understanding and measuring the competing demands of exploration and exploitation in sequential decision-making.

[1]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[2]  Francis Tuerlinckx,et al.  Fitting the ratcliff diffusion model to experimental data , 2007, Psychonomic bulletin & review.

[3]  Andrew Thomas,et al.  WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility , 2000, Stat. Comput..

[4]  William H Batchelder,et al.  An all-or-none theory for learning on both the paired-associate and concept levels , 1970 .

[5]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[6]  M. Lee Three case studies in the Bayesian analysis of cognitive models , 2008, Psychonomic bulletin & review.

[7]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[8]  I. J. Myung,et al.  Toward a method of selecting among computational models of cognition. , 2002, Psychological review.

[9]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[10]  M. Lee,et al.  Human and Optimal Exploration and Exploitation in Bandit Problems , 2009 .

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[13]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[14]  Angela J. Yu,et al.  Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[15]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[16]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[17]  A P Yonelinas,et al.  The contribution of recollection and familiarity to recognition and source-memory judgments: a formal dual-process model and an analysis of receiver operating characteristics. , 1999, Journal of experimental psychology. Learning, memory, and cognition.

[18]  J LunnDavid,et al.  WinBUGS A Bayesian modelling framework , 2000 .

[19]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[20]  Michael D. Lee,et al.  A Survey of Model Evaluation Approaches With a Tutorial on Hierarchical Bayesian Methods , 2008, Cogn. Sci..

[21]  M. Lee,et al.  A Bayesian analysis of human decision-making on bandit problems , 2009 .