Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization

Despite the significant interests and many progresses in decentralized multi-player multi-armed bandits (MP-MAB) problems in recent years, the regret gap to the natural centralized lower bound in the heterogeneous MP-MAB setting remains open. In this paper, we propose BEACON – Batched Exploration with Adaptive COmmunicatioN – that closes this gap. BEACON accomplishes this goal with novel contributions in implicit communication and efficient exploration. For the former, we propose a novel adaptive differential communication (ADC) design that significantly improves the implicit communication efficiency. For the latter, a carefully crafted batched exploration scheme is developed to enable incorporation of the combinatorial upper confidence bound (CUCB) principle. We then generalize the existing linear-reward MP-MAB problems, where the system reward is always the sum of individually collected rewards, to a new MP-MAB problem where the system reward is a general (nonlinear) function of individual rewards. We extend BEACON to solve this problem and prove a logarithmic regret. BEACON bridges the algorithm design and regret analysis of combinatorial MAB (CMAB) and MP-MAB, two largely disjointed areas in MAB, and the results in this paper suggest that this previously ignored connection is worth further investigation.

[1]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit with General Reward Functions , 2016, NIPS.

[2]  S'ebastien Bubeck,et al.  Coordination without communication: optimal regret in two players multi-armed bandits , 2020, COLT.

[3]  Shie Mannor,et al.  Multi-user lax communications: A multi-armed bandit approach , 2015, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[4]  Vianney Perchet,et al.  Selfish Robustness and Equilibria in Multi-Player Bandits , 2020, COLT.

[5]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[6]  Naumaan Nayyar,et al.  Decentralized Learning for Multiplayer Multiarmed Bandits , 2014, IEEE Transactions on Information Theory.

[7]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[8]  Andreas Krause,et al.  Multi-Player Bandits: The Adversarial Case , 2019, J. Mach. Learn. Res..

[9]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[10]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[11]  Vianney Perchet,et al.  Combinatorial semi-bandit with known covariance , 2016, NIPS.

[12]  Wtt Wtt Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[13]  Vianney Perchet,et al.  A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players , 2019, AISTATS.

[14]  Sumit J Darak,et al.  Multi-Player Multi-Armed Bandits for Stable Allocation in Heterogeneous Ad-Hoc Networks , 2018, IEEE Journal on Selected Areas in Communications.

[15]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16]  Mark Sellke,et al.  Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions , 2020, COLT.

[17]  Lilian Besson,et al.  {Multi-Player Bandits Revisited} , 2017, ALT.

[18]  Gábor Lugosi,et al.  Multiplayer bandits without observing collision information , 2018, Math. Oper. Res..

[19]  Harshvardhan Tibrewal,et al.  Multiplayer Multi-armed Bandits for Optimal Assignment in Heterogeneous Networks , 2019 .

[20]  Cong Shen,et al.  On No-Sensing Adversarial Multi-Player Multi-Armed Bandits With Collision Communications , 2021, IEEE Journal on Selected Areas in Information Theory.

[21]  Jing Yang,et al.  Decentralized Multi-player Multi-armed Bandits with No Collision Information , 2020, AISTATS.

[22]  Andrea J. Goldsmith,et al.  Adaptive coded modulation for fading channels , 1997, Proceedings of ICC'97 - International Conference on Communications.

[23]  Naumaan Nayyar,et al.  On Regret-Optimal Learning in Decentralized Multiplayer Multiarmed Bandits , 2015, IEEE Transactions on Control of Network Systems.

[24]  Abbas Jamalipour,et al.  Wireless communications , 2005, GLOBECOM '05. IEEE Global Telecommunications Conference, 2005..

[25]  Tavor Z. Baharav,et al.  My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits , 2020, ICML.

[26]  Amir Leshem,et al.  Game of Thrones: Fully Distributed Learning for Multiplayer Bandits , 2018, Math. Oper. Res..

[27]  Ao Tang,et al.  Opportunistic Spectrum Access with Multiple Users: Learning under Competition , 2010, 2010 Proceedings IEEE INFOCOM.

[28]  Yuval Peres,et al.  Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without , 2020, COLT.

[29]  Shie Mannor,et al.  Concurrent Bandits and Cognitive Radio Networks , 2014, ECML/PKDD.

[30]  Eshcar Hillel,et al.  Distributed Exploration in Multi-Armed Bandits , 2013, NIPS.

[31]  Amir Leshem,et al.  Distributed Multi-Player Bandits - a Game of Thrones Approach , 2018, NeurIPS.

[32]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[33]  Wei Chen,et al.  Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications , 2017, NIPS.

[34]  Zheng Wen,et al.  Combinatorial Cascading Bandits , 2015, NIPS.

[35]  Yi Gai,et al.  Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation , 2010, 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).

[36]  Zheng Wen,et al.  Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[37]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[38]  Vianney Perchet,et al.  SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits , 2018, NeurIPS.

[39]  Yajun Wang,et al.  Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[40]  Wei Chen,et al.  Thompson Sampling for Combinatorial Semi-Bandits , 2018, ICML.

[41]  Venugopal V. Veeravalli,et al.  Multi-User MABs with User Dependent Rewards for Uncoordinated Spectrum Access , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[42]  Kaito Ariu,et al.  Optimal Algorithms for Multiplayer Multi-Armed Bandits , 2019, AISTATS.

[43]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[44]  Jean C. Walrand,et al.  Fair end-to-end window-based congestion control , 2000, TNET.

[45]  Ohad Shamir,et al.  Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[46]  Vianney Perchet,et al.  Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits , 2020, NeurIPS.

[47]  Shie Mannor,et al.  Tight Lower Bounds for Combinatorial Multi-Armed Bandits , 2020, COLT.