Ballooning Multi-Armed Bandits

In this paper, we introduce Ballooning Multi-Armed Bandits (BL-MAB), a novel extension to the classical stochastic MAB model. In BL-MAB model, the set of available arms grows (or balloons) over time. In contrast to the classical MAB setting where the regret is computed with respect to the best arm overall, the regret in a BL-MAB setting is computed with respect to the best available arm at each time. We first observe that the existing MAB algorithms are not regret-optimal for the BL-MAB model. We show that if the best arm is equally likely to arrive at any time, a sub-linear regret cannot be achieved, irrespective of the arrival of other arms. We further show that if the best arm is more likely to arrive in the early rounds, one can achieve sub-linear regret. Our proposed algorithm determines (1) the fraction of the time horizon for which the newly arriving arms should be explored and (2) the sequence of arm pulls in the exploitation phase from among the explored arms. Making reasonable assumptions on the arrival distribution of the best arm in terms of the thinness of the distribution's tail, we prove that the proposed algorithm achieves sub-linear instance-independent regret. We further quantify the explicit dependence of regret on the arrival distribution parameters. We reinforce our theoretical findings with extensive simulation results.

[1]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[2]  Wei Tang,et al.  Bandit Learning with Biased Human Feedback , 2019, AAMAS.

[3]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[4]  Robert W. Chen,et al.  Bandit problems with infinitely many arms , 1997 .

[5]  W. Hoeffding On the Distribution of the Number of Successes in Independent Trials , 1956 .

[6]  Yang Liu,et al.  Incentivizing High Quality User Contributions: New Arm Generation in Bandit Learning , 2018, AAAI.

[7]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[8]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[9]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[10]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[11]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[12]  Lilian Besson,et al.  What Doubling Tricks Can and Can't Do for Multi-Armed Bandits , 2018, ArXiv.

[13]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[14]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[15]  Y. Narahari,et al.  Analysis of Thompson Sampling for Stochastic Sleeping Bandits , 2017, UAI.

[16]  P. Whittle Arm-Acquiring Bandits , 1981 .

[17]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[18]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[19]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2008, EC '09.

[20]  Aurélien Garivier,et al.  A minimax and asymptotically optimal algorithm for stochastic bandits , 2017, ALT.

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2009, EC '09.

[23]  Sujit Gujar,et al.  A quality assuring, cost optimal multi-armed bandit mechanism for expertsourcing , 2018, Artif. Intell..

[24]  Gaston H. Gonnet,et al.  On the LambertW function , 1996, Adv. Comput. Math..

[25]  Patrick Hummel,et al.  Learning and incentives in user-generated content: multi-armed bandits with endogenous arms , 2013, ITCS '13.

[26]  Setareh Maghsudi,et al.  Joint Channel Selection and Power Control in Infrastructureless Wireless Networks: A Multiplayer Multiarmed Bandit Framework , 2014, IEEE Transactions on Vehicular Technology.

[27]  Jure Leskovec,et al.  Discovering value from community activity on focused question answering sites: a case study of stack overflow , 2012, KDD.

[28]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[29]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[30]  Kristina Lerman,et al.  The myopia of crowds: Cognitive load and collective evaluation of answers on Stack Exchange , 2016, PloS one.

[31]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[32]  Michal Valko,et al.  Simple regret for infinitely many armed bandits , 2015, ICML.

[33]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[34]  A. Hoorfar,et al.  INEQUALITIES ON THE LAMBERTW FUNCTION AND HYPERPOWER FUNCTION , 2008 .

[35]  Aurélien Garivier,et al.  KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints , 2018, J. Mach. Learn. Res..

[36]  R. Devanand,et al.  Empirical study of Thompson sampling: Tuning the posterior parameters , 2017 .

[37]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[38]  Vianney Perchet,et al.  Anytime optimal algorithms in stochastic multi-armed bandits , 2016, ICML.