Structure Learning in Human Sequential Decision-Making

Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that has perfect knowledge of the model of how rewards and events are generated in the environment. Rather than being suboptimal, we argue that the learning problem humans face is more complex, in that it also involves learning the structure of reward generation in the environment. We formulate the problem of structure learning in sequential decision tasks using Bayesian reinforcement learning, and show that learning the generative model for rewards qualitatively changes the behavior of an optimal learning agent. To test whether people exhibit structure learning, we performed experiments involving a mixture of one-armed and two-armed bandit reward models, where structure learning produces many of the qualitative behaviors deemed suboptimal in previous studies. Our results demonstrate humans can perform structure learning in a near-optimal manner.

[1]  Daniel A. Braun,et al.  Structure learning in action , 2010, Behavioural Brain Research.

[2]  Y. Niv,et al.  Learning latent structure: carving nature at its joints , 2010, Current Opinion in Neurobiology.

[3]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[4]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[5]  W. Edwards,et al.  Probability learning in 1000 trials. , 1961, Journal of experimental psychology.

[6]  Paul R. Schrater,et al.  Bayesian modeling of human sequential decision-making on the multi-armed bandit problem , 2008 .

[7]  Y. Brackbill,et al.  Supplementary report: the utility of correctly predicting infrequent events. , 1962, Journal of Experimental Psychology.

[8]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[9]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[10]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[11]  A. Roth,et al.  Predicting How People Play Games: Reinforcement Learning in Experimental Games with Unique, Mixed Strategy Equilibria , 1998 .

[12]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[13]  M. Lee,et al.  A Bayesian analysis of human decision-making on bandit problems , 2009 .

[14]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[15]  Matthijs A. A. van der Meer,et al.  Integrating hippocampus and striatum in decision-making , 2007, Current Opinion in Neurobiology.

[16]  Thomas L. Griffiths,et al.  Structure Learning in Human Causal Induction , 2000, NIPS.

[17]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[18]  R. Meyer,et al.  Sequential Choice Under Ambiguity: Intuitive Solutions to the Armed-Bandit Problem , 1995 .

[19]  S. Kapur,et al.  Dopamine, prediction error and associative learning: A model-based account , 2006, Network.

[20]  J. Tenenbaum,et al.  Theory-based Bayesian models of inductive learning and reasoning , 2006, Trends in Cognitive Sciences.

[21]  T. SHALLICE,et al.  Learning and Memory , 1970, Nature.

[22]  Scott D. Brown,et al.  Prediction and Change Detection , 2005, NIPS.

[23]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[24]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[25]  Jonathan D. Cohen,et al.  Sequential effects: Superstition or rational behavior? , 2008, NIPS.

[26]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[27]  Yutaka Sakai,et al.  The Actor-Critic Learning Is Behind the Matching Law: Matching Versus Optimal Behaviors , 2008, Neural Computation.

[28]  Mark A. Olson,et al.  An experimental analysis of the bandit problem , 1997 .

[29]  David S. Touretzky,et al.  Model Uncertainty in Classical Conditioning , 2003, NIPS.

[30]  Christopher M. Anderson Behavioral models of strategies in multi-armed bandit problems , 2001 .

[31]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[32]  Rangarajan K. Sundaram,et al.  A class of bandit problems yielding myopic optimal strategies , 1992, Journal of Applied Probability.

[33]  W. Edwards Reward probability, amount, and information as determiners of sequential two-alternative decisions. , 1956, Journal of experimental psychology.

[34]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[35]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[36]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[37]  Nir Vulkan An Economist's Perspective on Probability Matching , 2000 .

[38]  Yutaka Sakai,et al.  When Does Reward Maximization Lead to Matching Law? , 2008, PloS one.

[39]  J. Gittins,et al.  The Learning Component of Dynamic Allocation Indices , 1992 .

[40]  Michael D. Lee,et al.  Modeling Human Performance in Restless Bandits with Particle Filters , 2009, J. Probl. Solving.

[41]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[42]  P. Andersen,et al.  [Learning and memory]. , 1995, Tidsskrift for den Norske laegeforening : tidsskrift for praktisk medicin, ny raekke.

[43]  Edmund Fantino,et al.  Probability matching: encouraging optimal responding in humans. , 2002, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[44]  Michael D. Lee,et al.  A Hierarchical Bayesian Model of Human Decision-Making on an Optimal Stopping Problem , 2006, Cogn. Sci..

[45]  Noah Gans,et al.  Simple Models of Discrete Choice and Their Performance in Bandit Experiments , 2007, Manuf. Serv. Oper. Manag..

[46]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[47]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.