Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

Abstract Multi-armed bandit tasks have been extensively used to model the problem of balancing exploitation and exploration. A most challenging variant of the MABP is the non-stationary bandit problem where the agent is faced with the increased complexity of detecting changes in its environment. In this paper we examine a non-stationary, discrete-time, finite horizon bandit problem with a finite number of arms and Gaussian rewards. A family of important ad hoc methods exists that are suitable for non-stationary bandit tasks. These learning algorithms that offer intuition-based solutions to the exploitation–exploration trade-off have the advantage of not relying on strong theoretical assumptions while in the same time can be fine-tuned in order to produce near-optimal results. An entirely different approach to the non-stationary multi-armed bandit problem presents itself in the face of evolutionary algorithms. We present an evolutionary algorithm that was implemented to solve the non-stationary bandit problem along with ad hoc solution algorithms, namely action-value methods with e-greedy and softmax action selection rules, the probability matching method and finally the adaptive pursuit method. A number of simulation-based experiments was conducted and based on the numerical results that we obtained we discuss the methods’ performances.

[1]  B. McCall,et al.  Systematic search, belated information, and the gittins' index , 1981 .

[2]  Dirk Thierens,et al.  An Adaptive Pursuit Strategy for Allocating Operator Probabilities , 2005, BNAIC.

[3]  Jean Walrand,et al.  Extensions of the multiarmed bandit problem: The discounted case , 1985 .

[4]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[5]  Benoît Leloup,et al.  Dynamic Pricing on the Internet: Theory and Simulations , 2001, Electron. Commer. Res..

[6]  Rina Azoulay-Schwartz,et al.  Exploitation vs. exploration: choosing a supplier in an environment of incomplete information , 2004, Decis. Support Syst..

[7]  Mayur S. Desai,et al.  Information technology project failures: Applying the bandit problem to evaluate managerial decision making , 2005, Inf. Manag. Comput. Security.

[8]  DE Economist A SURVEY ON THE BANDIT PROBLEM WITH SWITCHING COSTS , 2004 .

[9]  P. S. Sastry,et al.  A Class of Rapidly Converging Algorithms for Learning Automata , 1984 .

[10]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[13]  A. Mandelbaum,et al.  Multi-armed bandits in discrete and continuous time , 1998 .

[14]  M. A. L. THATHACHAR,et al.  A new approach to the design of reinforcement schemes for learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  D B Fogel,et al.  Do evolutionary processes minimize expected losses? , 2000, Journal of theoretical biology.

[16]  J. Banks,et al.  Switching Costs and the Gittins Index , 1994 .

[17]  Irene Valsecchi,et al.  Job assignment and bandit problems , 2003 .

[18]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .