Multi-Player Bandits Models Revisited

Multi-player Multi-Armed Bandits (MAB) have been extensively studied in the literature, motivated by applications to Cognitive Radio systems. Driven by such applications as well, we motivate the introduction of several levels of feedback for multi-player MAB algorithms. Most existing work assume that sensing information is available to the algorithm. Under this assumption, we improve the state-of-the-art lower bound for the regret of any decentralized algorithms and introduce two algorithms, RandTopM and MCTopM, that are shown to empirically outperform existing algorithms. Moreover, we provide strong theoretical guarantees for these algorithms, including a notion of asymptotic optimality in terms of the number of selections of bad arms. We then introduce a promising heuristic, called Selfish, that can operate without sensing information, which is crucial for emerging applications to Internet of Things networks. We investigate the empirical performance of this algorithm and provide some first theoretical elements for the understanding of its behavior.

[1]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[2]  Joseph Mitola,et al.  Cognitive radio: making software radios more personal , 1999, IEEE Wirel. Commun..

[3]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[4]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[5]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[6]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[7]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[8]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[9]  Roi Livni,et al.  Bandits with Movement Costs and Adaptive Pricing , 2017, COLT.

[10]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[11]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[12]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[13]  Ao Tang,et al.  Opportunistic Spectrum Access with Multiple Users: Learning under Competition , 2010, 2010 Proceedings IEEE INFOCOM.

[14]  Shie Mannor,et al.  Learning to coordinate without communication in multi-user multi-armed bandit problems , 2015, ArXiv.

[15]  Shie Mannor,et al.  Multi-user lax communications: A multi-armed bandit approach , 2015, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[16]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[17]  Rómer Rosales,et al.  Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..

[18]  Ohad Shamir,et al.  Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[19]  Jacques Palicot,et al.  Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings , 2017, CrownCom.

[20]  Naumaan Nayyar,et al.  Decentralized learning for multi-player multi-armed bandits , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[21]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[22]  Wassim Jouini,et al.  Multi-armed bandit based policies for cognitive radio's decision making issues , 2009, 2009 3rd International Conference on Signals, Circuits and Systems (SCS).

[23]  D. Ernst,et al.  Upper Confidence Bound Based Decision Making Strategies and Dynamic Spectrum Access , 2010, 2010 IEEE International Conference on Communications.

[24]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[25]  A. Appendix Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015 .

[26]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[27]  Djallel Bouneffouf,et al.  Finite-time analysis of the multi-armed bandit problem with known trend , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[28]  Mingyan Liu,et al.  Online learning in decentralized multi-user spectrum access with synchronized explorations , 2012, MILCOM 2012 - 2012 IEEE Military Communications Conference.

[29]  H. Robbins Some aspects of the sequential design of experiments , 1952 .