论文信息 - Decentralized learning for multi-player multi-armed bandits

Decentralized learning for multi-player multi-armed bandits

We consider the problem of distributed online learning with multiple players in multi-armed bandit models. Each player can pick among multiple arms. As a player picks an arm, it gets a reward from an unknown distribution with an unknown mean. The arms give different rewards to different players. If two players pick the same arm, there is a “collision”, and neither of them get any reward. There is no dedicated control channel for coordination or communication among the players. Any other communication between the users is costly and will add to the regret. We propose an online index-based learning policy called dUCB4 algorithm that trades off exploration v. exploitation in the right way, and achieves expected regret that grows at most near-O(log2 T). The motivation comes from opportunistic spectrum access by multiple secondary users in cognitive radio networks wherein they must pick among various wireless channels that look different to different users.

[1] Wenhan Dai,et al. Efficient online learning for opportunistic spectrum access , 2012, 2012 Proceedings IEEE INFOCOM.

[2] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[3] Mingyan Liu,et al. Online learning in opportunistic spectrum access: A restless bandit approach , 2010, 2011 Proceedings IEEE INFOCOM.

[4] Mingyan Liu,et al. Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[5] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[6] Qing Zhao,et al. Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[7] D. Bertsekas. The auction algorithm: A distributed relaxation method for the assignment problem , 1988 .

[8] P. Lezaud. Chernoff-type bound for finite Markov chains , 1998 .

[9] Dimitri P. Bertsekas,et al. Auction algorithms for network flow problems: A tutorial introduction , 1992, Comput. Optim. Appl..

[10] George J. Pappas,et al. A distributed auction algorithm for the assignment problem , 2008, 2008 47th IEEE Conference on Decision and Control.

[11] Bhaskar Krishnamachari,et al. Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[12] Qing Zhao,et al. Multi-armed bandit problems with heavy-tailed reward distributions , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[13] Mingyan Liu,et al. On the Combinatorial Multi-Armed Bandit Problem with Markovian Rewards , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[14] Mingyan Liu,et al. Online algorithms for the multi-armed bandit problem with Markovian rewards , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[15] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16] Vijay K. Bhargava,et al. Cognitive Wireless Communication Networks , 2007 .

[17] Qing Zhao,et al. Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics , 2010, IEEE Transactions on Information Theory.

[18] Yi Gai,et al. Decentralized Online Learning Algorithms for Opportunistic Spectrum Access , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[19] J. Walrand,et al. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[20] J. Lamperti. ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[21] Wenhan Dai,et al. The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret , 2010, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] John N. Tsitsiklis,et al. The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[23] Ananthram Swami,et al. Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.