Cheap but Clever: Human Active Learning in a Bandit Setting

How people achieve long-term goals in an imperfectly known environment, via repeated tries and noisy outcomes, is an important problem in cognitive science. There are two interrelated questions: how humans represent information, both what has been learned and what can still be learned, and how they choose actions, in particular how they negotiate the tension between exploration and exploitation. In this work, we examine human behavioral data in a multi-armed bandit setting, in which the subject choose one of four “arms” to pull on each trial and receives a binary outcome (win/lose). We implement both the Bayes-optimal policy, which maximizes the expected cumulative reward in this finite-horizon bandit environment, as well as a variety of heuristic policies that vary in their complexity of information representation and decision policy. We find that the knowledge gradient algorithm, which combines exact Bayesian learning with a decision policy that maximizes a combination of immediate reward gain and longterm knowledge gain, captures subjects’ trial-by-trial choice best among all the models considered; it also provides the best approximation to the computationally intense optimal policy among all the heuristic policies.