Strategic exploration in human adaptive control

How do people explore in order to gain rewards in uncertain dynamical systems? Within a reinforcement learning paradigm, control normally involves trading off between exploration (i.e. trying out actions in order to gain more knowledge about the system) and exploitation (i.e. using current knowledge of the system to maximize reward). We study a novel control task in which participants must steer a boat on a grid, assessing whether participants explore strategically in order to produce higher rewards later on. We find that participants explore strategically yet conservatively, exploring more when mistakes are less costly and practicing actions that will be needed later on.

[1]  John Langford,et al.  Efficient Exploration in Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[2]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[3]  Tamer Basar,et al.  Dual Control Theory , 2001 .

[4]  M. Speekenbrink,et al.  Putting bandits into context: How function learning supports decision making , 2016, bioRxiv.

[5]  Jonathan D. Nelson,et al.  Mapping the unknown: The spatially correlated multi-armed bandit , 2017, bioRxiv.

[6]  Philipp Hennig,et al.  Dual Control for Approximate Bayesian Reinforcement Learning , 2015, J. Mach. Learn. Res..

[7]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[8]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[9]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[10]  Jonathan D. Cohen,et al.  Humans use directed and random exploration to solve the explore-exploit dilemma. , 2014, Journal of experimental psychology. General.

[11]  KrauseAndreas,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2012 .

[12]  Yaakov Bar-Shalom,et al.  An actively adaptive control for linear systems with random parameters via the dual control approach , 1972, CDC 1972.