论文信息 - Multi-Bandit Best Arm Identification

Multi-Bandit Best Arm Identification

We study the problem of identifying the best arm in each of the bandits in a multi-bandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

[1] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[2] Andrew W. Moore,et al. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[3] Michail G. Lagoudakis,et al. Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[4] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[6] Csaba Szepesvári,et al. Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[7] Csaba Szepesvári,et al. Empirical Bernstein stopping , 2008, ICML '08.

[8] Christos Dimitrakakis,et al. Rollout sampling approximate policy iteration , 2008, Machine Learning.

[9] Rémi Munos,et al. Pure Exploration for Multi-Armed Bandit Problems , 2008, ArXiv.

[10] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[11] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[12] Dominik D. Freydenberger,et al. Can We Learn to Gamble Efficiently? , 2010, COLT.

[13] Joelle Pineau,et al. Active learning for personalizing treatment , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).