论文信息 - An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.

José Carlos Príncipe | Isaac J. Sledge | J. Príncipe | I. Sledge

[1] Akimichi Takemura,et al. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards , 2015, J. Mach. Learn. Res..

[2] M. A. L. THATHACHAR,et al. A new approach to the design of reinforcement schemes for learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[3] José Carlos Príncipe,et al. Balancing exploration and exploitation in reinforcement learning using a value of information criterion , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Matthew J. Streeter,et al. Tighter Bounds for Multi-Armed Bandits with Expert Advice , 2009, COLT.

[5] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[6] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[7] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8] Shie Mannor,et al. PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[9] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[10] Kumpati S. Narendra,et al. Learning Automata - A Survey , 1974, IEEE Trans. Syst. Man Cybern..

[11] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[12] Claudio Gentile,et al. Boltzmann Exploration Done Right , 2017, NIPS.

[13] Csaba Szepesvári,et al. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[14] Nicky J Welton,et al. Value of Information , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.

[15] Maria-Florina Balcan,et al. Robust Interactive Learning , 2012, COLT.

[16] Robert D. Kleinberg. Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[17] Rémi Munos,et al. Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[18] José Carlos Príncipe,et al. Analysis of Agent Expertise in Ms. Pac-Man Using Value-of-Information-Based Policies , 2017, IEEE Transactions on Games.

[19] Andreas Krause,et al. Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[20] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[21] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22] Rémi Munos,et al. Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[23] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[24] Chris Mesterharm,et al. Experience-efficient learning in associative bandit problems , 2006, ICML.

[25] Marcos Salganicoff,et al. Active Exploration and Learning in real-Valued Spaces using Multi-Armed Bandit Allocation Indices , 1995, ICML.

[26] Nicolò Cesa-Bianchi,et al. Finite-Time Regret Bounds for the Multiarmed Bandit Problem , 1998, ICML.