论文信息 - Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems

Abstract Multi-armed bandit tasks have been extensively used to model the problem of balancing exploitation and exploration. A most challenging variant of the MABP is the non-stationary bandit problem where the agent is faced with the increased complexity of detecting changes in its environment. In this paper we examine a non-stationary, discrete-time, finite horizon bandit problem with a finite number of arms and Gaussian rewards. A family of important ad hoc methods exists that are suitable for non-stationary bandit tasks. These learning algorithms that offer intuition-based solutions to the exploitation–exploration trade-off have the advantage of not relying on strong theoretical assumptions while in the same time can be fine-tuned in order to produce near-optimal results. An entirely different approach to the non-stationary multi-armed bandit problem presents itself in the face of evolutionary algorithms. We present an evolutionary algorithm that was implemented to solve the non-stationary bandit problem along with ad hoc solution algorithms, namely action-value methods with e-greedy and softmax action selection rules, the probability matching method and finally the adaptive pursuit method. A number of simulation-based experiments was conducted and based on the numerical results that we obtained we discuss the methods’ performances.

A. S. Xanthopoulos | Dimitris E. Koulouriotis | D. Koulouriotis | A. Xanthopoulos

[1] B. McCall,et al. Systematic search, belated information, and the gittins' index , 1981 .

[2] Dirk Thierens,et al. An Adaptive Pursuit Strategy for Allocating Operator Probabilities , 2005, BNAIC.

[3] Jean Walrand,et al. Extensions of the multiarmed bandit problem: The discounted case , 1985 .

[4] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[5] Benoît Leloup,et al. Dynamic Pricing on the Internet: Theory and Simulations , 2001, Electron. Commer. Res..

[6] Rina Azoulay-Schwartz,et al. Exploitation vs. exploration: choosing a supplier in an environment of incomplete information , 2004, Decis. Support Syst..

[7] Mayur S. Desai,et al. Information technology project failures: Applying the bandit problem to evaluate managerial decision making , 2005, Inf. Manag. Comput. Security.

[8] DE Economist. A SURVEY ON THE BANDIT PROBLEM WITH SWITCHING COSTS , 2004 .

[9] P. S. Sastry,et al. A Class of Rapidly Converging Algorithms for Learning Automata , 1984 .

[10] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12] P. W. Jones,et al. Bandit Problems, Sequential Allocation of Experiments , 1987 .

[13] A. Mandelbaum,et al. Multi-armed bandits in discrete and continuous time , 1998 .

[14] M. A. L. THATHACHAR,et al. A new approach to the design of reinforcement schemes for learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[15] D B Fogel,et al. Do evolutionary processes minimize expected losses? , 2000, Journal of theoretical biology.

[16] J. Banks,et al. Switching Costs and the Gittins Index , 1994 .

[17] Irene Valsecchi,et al. Job assignment and bandit problems , 2003 .

[18] J. Bather,et al. Multi‐Armed Bandit Allocation Indices , 1990 .