Uncertainty and Exploration in a Restless Bandit Problem

Decision making in noisy and changing environments requires a fine balance between exploiting knowledge about good courses of action and exploring the environment in order to improve upon this knowledge. We present an experiment on a restless bandit task in which participants made repeated choices between options for which the average rewards changed over time. Comparing a number of computational models of participants' behavior in this task, we find evidence that a substantial number of them balanced exploration and exploitation by considering the probability that an option offers the maximum reward out of all the available options.

[1]  Paolo Viappiani,et al.  Thompson Sampling for Bayesian Bandits with Resets , 2013, ADT.

[2]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[3]  Angela J. Yu,et al.  Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[4]  Paul R. Schrater,et al.  Bayesian modeling of human sequential decision-making on the multi-armed bandit problem , 2008 .

[5]  Chang‐Jin Kim,et al.  Dynamic linear models with Markov-switching , 1994 .

[6]  J. Busemeyer,et al.  A contribution of cognitive decision models to clinical assessment: decomposing performance on the Bechara gambling task. , 2002, Psychological assessment.

[7]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[8]  G. Bower,et al.  From conditioning to category learning: an adaptive network model. , 1988 .

[9]  Michael D. Lee,et al.  Modeling Human Performance in Restless Bandits with Particle Filters , 2009, J. Probl. Solving.

[10]  A. Roth,et al.  Predicting How People Play Games: Reinforcement Learning in Experimental Games with Unique, Mixed Strategy Equilibria , 1998 .

[11]  Ashok K. Agrawala,et al.  Thompson Sampling for Dynamic Multi-armed Bandits , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[12]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[15]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[16]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[17]  Eldad Yechiam,et al.  Comparison of basic assumptions embedded in learning models for experience-based decision making , 2005, Psychonomic bulletin & review.

[18]  R. E. Kalman,et al.  New Results in Linear Filtering and Prediction Theory , 1961 .

[19]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[20]  P. Stone,et al.  The Nature of Belief-Directed Exploratory Choice in Human Decision-Making , 2011, Front. Psychology.

[21]  A. Tversky,et al.  Advances in prospect theory: Cumulative representation of uncertainty , 1992 .

[22]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[23]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[24]  Iain D. Gilchrist,et al.  Testing a Simplified Method for Measuring Velocity Integration in Saccades Using a Manipulation of Target Contrast , 2011, Front. Psychology.

[25]  M. Lee,et al.  A Bayesian analysis of human decision-making on bandit problems , 2009 .

[26]  Angela J. Yu,et al.  the trade-off between exploitation and exploration Should I stay or should I go ? How the human brain manages , 2008 .

[27]  Jerome R. Busemeyer,et al.  Comparison of Decision Learning Models Using the Generalization Criterion Method , 2008, Cogn. Sci..

[28]  R. Duncan Luce,et al.  Individual Choice Behavior , 1959 .

[29]  Ole-Christoffer Granmo,et al.  Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters , 2010, IEA/AIE.

[30]  E. Wagenmakers,et al.  AIC model selection using Akaike weights , 2004, Psychonomic bulletin & review.