The Max $K$-Armed Bandit: PAC Lower Bounds and Efficient Algorithms

We consider the Max $K$-Armed Bandit problem, where a learning agent is faced with several stochastic arms, each a source of i.i.d. rewards of unknown distribution. At each time step the agent chooses an arm, and observes the reward of the obtained sample. Each sample is considered here as a separate item with the reward designating its value, and the goal is to find an item with the highest possible value. Our basic assumption is a known lower bound on the {\em tail function} of the reward distributions. Under the PAC framework, we provide a lower bound on the sample complexity of any $(\epsilon,\delta)$-correct algorithm, and propose an algorithm that attains this bound up to logarithmic factors. We analyze the robustness of the proposed algorithm and in addition, we compare the performance of this algorithm to the variant in which the arms are not distinguishable by the agent and are chosen randomly at each stage. Interestingly, when the maximal rewards of the arms happen to be similar, the latter approach may provide better performance.

[1]  Eyke Hüllermeier,et al.  Qualitative Multi-Armed Bandits: A Quantile-Based Approach , 2015, ICML.

[2]  Stephen F. Smith,et al.  An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem , 2006, AAAI.

[3]  Filip Radlinski,et al.  Mortal Multi-Armed Bandits , 2008, NIPS.

[4]  Alexandre Proutière,et al.  Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards , 2013, NIPS.

[5]  Robert W. Chen,et al.  Bandit problems with infinitely many arms , 1997 .

[6]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[7]  Michal Valko,et al.  Extreme bandits , 2014, NIPS.

[8]  S. Gelly,et al.  Anytime many-armed bandits , 2007 .

[9]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Stephen F. Smith,et al.  The Max K-Armed Bandit: A New Model of Exploration Applied to Search Heuristic Selection , 2005, AAAI.

[12]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[13]  J. Norris Appendix: probability and measure , 1997 .

[14]  Nahum Shimkin,et al.  Infinitely Many-Armed Bandits with Unknown Value Distribution , 2014, ECML/PKDD.

[15]  Michal Valko,et al.  Simple regret for infinitely many armed bandits , 2015, ICML.

[16]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[17]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[18]  Stephen F. Smith,et al.  A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem , 2006, CP.