论文信息 - A Change Detection Model for Non-Stationary k-Armed Bandit Problems

A Change Detection Model for Non-Stationary k-Armed Bandit Problems

Our presentation deals with the question “Is cognitive science relevant to AI problems?”. We support the position that cross-fertilization between the two fields is not only possible, but necessary. We provide a concrete example where inspiration from cognitive science led to concrete lines of inquiry in reinforcement learning. Learning in a non-stationary environment requires quick and accurate estimation of the world's current state, as well as fast adaptation to change. Psychologists have described a kind of behavior in animals known as matching (Herrnstein, 1961). Matching behavior is the tendency of subjects to match the relative time and effort they invest in various foraging options so that the investment proportions match the income proportions. This behavior has been consistently observed in experiments with mice and other species, and it has been shown that subjects robustly and quickly adapt their behavior to match abrupt changes in income distribution. In this paper we investigate the use of change-detection models proposed by Gallistel et al (2001) to explain animal behavior, in the context of non-stationary k-armed bandits. We apply this change-detection scheme to the problem of reacting to changes in channel availability in wireless networks. We show that payoff prediction can be improved by reestimation of distribution parameters after an abrupt change has been detected. We compare two strategies for a nonstationary two-armed bandit. Each arm, A1 and A2, when pulled, pays a certain amount according to a Poisson distribution with parameters λ1 and λ2 respectively. After a random number of trials, these parameters change abruptly, and the optimal policy switches from one arm to the other. The first strategy is standard epsilon-greedy. After pulling arm Ai at step t, its expected payoff E(Ai) is updated according to the rule: Et(Ai) = (1-α) Et-1(Ai) + α rt, where rt is the immediate payoff just obtained. The second strategy uses a change-detection model to identify an inflection point in the payoff rates. Once a change has been detected at point b, the current estimate is reset to the immediate payoff at that point, Et(Ai)=rb We show how this model can significantly improve performance on general non-stationary k-armed bandits, and present real-world experiments on the problem of choosing among a set of k possible packet sizes to transmit over a wireless network. Compilation copyright © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.

Michael L. Littman | Carlos Diuk