Decentralized multi-armed bandit with multiple distributed players

We formulate and study a decentralized multi-armed bandit (MAB) problem, where M distributed players compete for N independent arms with unknown reward statistics. At each time, each player chooses one arm to play without exchanging information with other players. Players choosing the same arm collide, and, depending on the collision model, either no one receives reward or the colliding players share the reward in an arbitrary way. We show that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly. A general framework of constructing fair and order-optimal decentralized policies is established based on a Time Division Fair Sharing (TDFS) of the M best arms. A lower bound on the system regret growth rate is established for a general class of decentralized polices, to which all TDFS policies belong. We further develop several fair and order-optimal decentralized polices within the TDFS framework and study their performance in different applications including cognitive radio networks, multi-channel communications in unknown fading environment, target collecting in multi-agent systems, and web search and advertising.

[1]  Pierre Pepin,et al.  The robustness of lognormal-based estimators of abundance , 1990 .

[2]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[5]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[6]  Yi Gai,et al.  Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation , 2010, 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).

[7]  Michael A. Bean,et al.  Probability: The Science of Uncertainty with Applications to Investments, Insurance, and Engineering , 2000 .

[8]  W. E. Sharp,et al.  A log-normal distribution of alluvial diamonds with an economic cutoff , 1976 .

[9]  Hai Jiang,et al.  Medium access in cognitive radio networks: A competitive multi-armed bandit framework , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[10]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[11]  Brian M. Sadler,et al.  A Survey of Dynamic Spectrum Access , 2007, IEEE Signal Processing Magazine.

[12]  Ao Tang,et al.  Opportunistic Spectrum Access with Multiple Users: Learning under Competition , 2010, 2010 Proceedings IEEE INFOCOM.