论文信息 - Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit problem. The model assumes no communication at all between the players, and furthermore when two (or more) players select the same action this results in a maximal loss. We prove the first $\sqrt{T}$-type regret guarantee for this problem, under the feedback model where collisions are announced to the colliding players. Such a bound was not known even for the simpler stochastic version. We also prove the first sublinear guarantee for the feedback model where collision information is not available, namely $T^{1-\frac{1}{2m}}$ where $m$ is the number of players.

Yuval Peres | Yuanzhi Li | Sébastien Bubeck | Mark Sellke

[1] J. Walrand,et al. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[2] Berthold Vöcking,et al. Regret Minimization for Online Buffering Problems Using the Weighted Majority Algorithm , 2010, Electron. Colloquium Comput. Complex..

[3] Vianney Perchet,et al. SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits , 2018, NeurIPS.

[4] Qing Zhao,et al. Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[5] Atsuyoshi Nakamura,et al. Algorithms for Adversarial Bandit Problems with Multiple Plays , 2010, ALT.

[6] Shie Mannor,et al. Concurrent Bandits and Cognitive Radio Networks , 2014, ECML/PKDD.

[7] Yuval Peres,et al. Bandits with switching costs: T2/3 regret , 2013, STOC.

[8] Ohad Shamir,et al. Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[9] Jacques Palicot,et al. Multi-Armed Bandit Learning in IoT Networks: Learning Helps Even in Non-stationary Settings , 2017, CrownCom.

[10] Oded Goldreich,et al. A Primer on Pseudorandom Generators , 2010 .

[11] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[12] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[13] Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences , 2005 .

[14] Hai Jiang,et al. Medium access in cognitive radio networks: A competitive multi-armed bandit framework , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[15] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[16] Yishay Mansour,et al. From External to Internal Regret , 2005, J. Mach. Learn. Res..

[17] Gábor Lugosi,et al. Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[18] Ananthram Swami,et al. Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[19] Gábor Lugosi,et al. Multiplayer bandits without observing collision information , 2018, Math. Oper. Res..

[20] Robert D. Kleinberg,et al. Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[21] Y. Freund,et al. The non-stochastic multi-armed bandit problem , 2001 .

[22] Andreas Krause,et al. Multi-Player Bandits: The Adversarial Case , 2019, J. Mach. Learn. Res..