Learning near-optimal search in a minimal explore/exploit task

Learning near-optimal search in a minimal explore/exploit task Ke Sang (kesang@indiana.edu), Peter M. Todd (pmtodd), Robert L. Goldstone (rgoldsto) Cognitive Science Program and Department of Psychological and Brain Sciences, Indiana University 1101 E. 10th Street, Bloomington, IN 47405 USA Abstract How well do people search an environment for non-depleting resources of different quality, where it is necessary to switch between exploring for new resources and exploiting those already found? Employing a simple card selection task to study exploitation and exploration, we find that the total resources accrued, the number of switches between exploring and exploiting, and the number of trials until stable exploitation becomes more similar to those of the optimal strategy as experience increases across searches. Subjects learned to adjust their effective (implicit) thresholds for exploitation toward the optimal threshold over 30 searches. Those implicit thresholds decrease over turns within each search, just as the optimal threshold does, but subjects’ explicitly stated exploitation threshold increases over turns. Nonetheless, both the explicit and learned implicit thresholds produced performance close to optimal. Keywords: exploration; exploitation; explore/exploit tradeoff; optimal search; threshold strategy. Introduction Search is a ubiquitous requirement of everyday life. Scientists need to search for information to help their research; web users use search engines like Google to get whatever they are interested in from the internet; companies search for the best candidates for their job openings; consumers searching in supermarkets with hundreds of brands of candies have to decide if they have found one that is good enough or if they should explore to find something even tastier. In many real life situations, to search (or explore) or to stop searching (and exploit the fruits of the search) is a key issue for making better decisions. Organisms have to make tradeoffs between exploration and exploitation so as to improve their success in the environment. Consider a honeybee searching for nectar in flowers. Suppose the honeybee has visited a particular plant and found most of the nectar in its flowers. The bee must decide whether it is worth spending more time to find still more nectar on this plant, exploiting it further, or whether it would be better off leaving this plant and exploring to look for another. Staying too long on the flowers of this plant is wasteful, and the bee should move to another plant with higher initial rate of nectar supply; however, leaving that initial flower plant too early is also suboptimal because travelling between resource patches will cost time and energy, and there is uncertainty about the resource levels of flowers that have not yet been visited. To maximize intake of nectar, the bee needs a decision rule that balances exploration of new resource sites with exploitation of known resource sites (Charnov, 1976). The same tradeoff between exploiting what you already have and exploring further to find something preferable applies to humans. For instance, should you take the parking space you have just found or keep driving closer to your destination hoping to find a better one? Should you stick with your current job, or partner, or brand of coffee, or explore further to see if there are better options to be found? Many researchers have focused on aspects of exploration versus exploitation. Optimal decision mechanisms and heuristic rules of thumb have been proposed to model when animals leave patches to find new ones (Charnov, 1976; Bell, 1991; Livoreil & Giraldeau, 1997; Wajnberg, Fauvegue, & Pons, 2000). Mathematicians have studied optimal stopping problems where the task is to decide when to stop the exploration phase of search and exploit a particular chosen option; Ferguson (1989) reviews work on one well-known form of this task, the so-called Secretary Problem. Todd and Miller (1999) applied this kind of framework to the problem of searching for a mate, studying the simple heuristics that could work well to stop exploratory search once an appropriate partner was encountered, and Beckage, Todd, Penke, and Asendorpf (2009) found evidence of use of such rules by people searching for mates at speed-dating events. Lee (2006) developed Hierarchical Bayesian models to account for human decision making on an optimal stopping problem. Different resource types and environmental structures call for different search strategies. Thus, how well humans perform in experiments involving the exploration/ exploitation tradeoff depends on the task details, which influence not only optimal search strategies, but also the actual strategies employed by subjects. In this paper we focus on search behavior in a resource-accumulation setting, in which individuals make a series of decisions as to whether to explore to find a new resource or exploit a previously-encountered one, accumulating value from both newly-found and previously-discovered, currently-exploited resources as they search. Search Task In the experiment, individuals had to accrue as many points from cards as possible over a 20-turn game. At each turn, a subject could either explore by flipping over a card with unknown points from a card deck, or exploit a card already uncovered by selecting it from a computer screen. With this accumulation of resources (e.g. points) during both exploration and exploitation and the ability to return to previously-found items, this search task resembles a non- competitive foraging task with non-depleting resources.