An Instance-Dependent Analysis for the Cooperative Multi-Player Multi-Armed Bandit

We study the problem of information sharing and cooperation in Multi-Player Multi-Armed bandits. We propose the first algorithm that achieves logarithmic regret for this problem. Our results are based on two innovations. First, we show that a simple modification to a successive elimination strategy can be used to allow the players to estimate their suboptimality gaps, up to constant factors, in the absence of collisions. Second, we leverage the first result to design a communication protocol that successfully uses the small reward of collisions to coordinate among players, while preserving meaningful instance-dependent logarithmic regret guarantees.

[1]  K. Choromanski,et al.  Effective Diversity in Population-Based Reinforcement Learning , 2020, NeurIPS.

[2]  Feng Ruan,et al.  Bandit Learning in Decentralized Matching Markets , 2020, J. Mach. Learn. Res..

[3]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[4]  Jing Yang,et al.  Decentralized Multi-player Multi-armed Bandits with No Collision Information , 2020, AISTATS.

[5]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[6]  Jon D. McAuliffe,et al.  Uniform, nonparametric, non-asymptotic confidence sequences , 2018 .

[7]  Ohad Shamir,et al.  Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[8]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[9]  Mark Sellke,et al.  Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions , 2020, COLT.

[10]  Hai Jiang,et al.  Medium access in cognitive radio networks: A competitive multi-armed bandit framework , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[11]  Vianney Perchet,et al.  SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits , 2018, NeurIPS.

[12]  Lilian Besson,et al.  {Multi-Player Bandits Revisited} , 2017, ALT.

[13]  S'ebastien Bubeck,et al.  Coordination without communication: optimal regret in two players multi-armed bandits , 2020, COLT.

[14]  Yuval Peres,et al.  Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without , 2020, COLT.

[15]  Shie Mannor,et al.  Concurrent Bandits and Cognitive Radio Networks , 2014, ECML/PKDD.

[16]  Andreas Krause,et al.  Multi-Player Bandits: The Adversarial Case , 2019, J. Mach. Learn. Res..

[17]  Gábor Lugosi,et al.  Multiplayer bandits without observing collision information , 2018, Math. Oper. Res..

[18]  Richard Combes,et al.  Towards Optimal Algorithms for Multi-Player Bandits without Collision Sensing Information , 2021, ArXiv.