Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit problem. The model assumes no communication at all between the players, and furthermore when two (or more) players select the same action this results in a maximal loss. We prove the first $\sqrt{T}$-type regret guarantee for this problem, under the feedback model where collisions are announced to the colliding players. Such a bound was not known even for the simpler stochastic version. We also prove the first sublinear guarantee for the feedback model where collision information is not available, namely $T^{1-\frac{1}{2m}}$ where $m$ is the number of players.

[1]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[2]  Berthold Vöcking,et al.  Regret Minimization for Online Buffering Problems Using the Weighted Majority Algorithm , 2010, Electron. Colloquium Comput. Complex..

[3]  Vianney Perchet,et al.  SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits , 2018, NeurIPS.

[4]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[5]  Atsuyoshi Nakamura,et al.  Algorithms for Adversarial Bandit Problems with Multiple Plays , 2010, ALT.

[6]  Shie Mannor,et al.  Concurrent Bandits and Cognitive Radio Networks , 2014, ECML/PKDD.

[7]  Yuval Peres,et al.  Bandits with switching costs: T2/3 regret , 2013, STOC.

[8]  Ohad Shamir,et al.  Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[9]  Jacques Palicot,et al.  Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings , 2017, CrownCom.

[10]  Oded Goldreich,et al.  A Primer on Pseudorandom Generators , 2010 .

[11]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[12]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[13]  Gilles Stoltz Incomplete information and internal regret in prediction of individual sequences , 2005 .

[14]  Hai Jiang,et al.  Medium access in cognitive radio networks: A competitive multi-armed bandit framework , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[15]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[16]  Yishay Mansour,et al.  From External to Internal Regret , 2005, J. Mach. Learn. Res..

[17]  Gábor Lugosi,et al.  Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[18]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[19]  Gábor Lugosi,et al.  Multiplayer bandits without observing collision information , 2018, Math. Oper. Res..

[20]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[21]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[22]  Andreas Krause,et al.  Multi-Player Bandits: The Adversarial Case , 2019, J. Mach. Learn. Res..