Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits

In this paper, we introduce a multi-agent multiarmed bandit-based model for ad hoc teamwork with expensive communication. The goal of the team is to maximize the total reward gained from pulling arms of a bandit over a number of epochs. In each epoch, each agent decides whether to pull an arm, or to broadcast the reward it obtained in the previous epoch to the team and forgo pulling an arm. These decisions must be made only on the basis of the agent’s private information and the public information broadcast prior to that epoch. We first benchmark the achievable utility by analyzing an idealized version of this problem where a central authority has complete knowledge of rewards acquired from all arms in all epochs and uses a multiplicative weights update algorithm for allocating arms to agents. We then introduce an algorithm for the decentralized setting that uses a value-ofinformation based communication strategy and an exploration-exploitation strategy based on the centralized algorithm, and show experimentally that it converges rapidly to the performance of the centralized method.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[2]  Naumaan Nayyar,et al.  Multi-player multi-armed bandits: Decentralized learning with IID rewards , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[3]  H. Roche,et al.  Why Copy Others? Insights from the Social Learning Strategies Tournament , 2010 .

[4]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[5]  Boaz Barak Computational Models , 2011, Encyclopedia of Parallel Computing.

[6]  Eshcar Hillel,et al.  Distributed Exploration in Multi-Armed Bandits , 2013, NIPS.

[7]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[8]  N. Jojic,et al.  Ieee Transactions on Signal Processing: Supplement on Secure Media 1 Facecerts Ieee Transactions on Signal Processing: Supplement on Secure Media 2 , 2003 .

[9]  K. Schlag Why Imitate, and If So, How?, : A Boundedly Rational Approach to Multi-armed Bandits , 1998 .

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[12]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[13]  Grey Giddins,et al.  Statistics , 2016, The Journal of hand surgery, European volume.

[14]  Sarit Kraus,et al.  Collaborative Plans for Complex Group Action , 1996, Artif. Intell..

[15]  Li Zhang,et al.  Information sharing in distributed stochastic bandits , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[16]  Peter Stone,et al.  Cooperating with Unknown Teammates in Complex Domains: A Robot Soccer Case Study of Ad Hoc Teamwork , 2015, AAAI.

[17]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[19]  Sarit Kraus,et al.  To teach or not to teach?: decision making under uncertainty in ad hoc teams , 2010, AAMAS.

[20]  Subramanian Ramamoorthy,et al.  A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems , 2013, AAMAS.

[21]  Christos Dimitrakakis,et al.  Differentially private, multi-agent multi-armed bandits , 2015, EWRL 2015.

[22]  Manfred K. Warmuth,et al.  THE WEIGHTED MAJORITY ALGORITHM (Supersedes 89-16) , 1992 .

[23]  István Hegedüs,et al.  Gossip-based distributed stochastic bandit algorithms , 2013, ICML.

[24]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[25]  Carlos Gershenson,et al.  Information and Computation , 2013, Handbook of Human Computation.

[26]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[27]  Grégory Bonnet,et al.  Multi-Armed Bandit Policies for Reputation Systems , 2014, PAAMS.

[28]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[29]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[30]  T. Lillicrap,et al.  Why Copy Others? Insights from the Social Learning Strategies Tournament , 2010, Science.

[31]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[32]  Matthew E. Taylor,et al.  Identifying and Tracking Switching, Non-Stationary Opponents: A Bayesian Approach , 2016, AAAI Workshop: Multiagent Interaction without Prior Coordination.

[33]  Ronald A. Howard,et al.  Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[34]  Daphne Koller,et al.  Making Rational Decisions Using Adaptive Utility Elicitation , 2000, AAAI/IAAI.

[35]  Milind Tambe,et al.  Towards Flexible Teamwork , 1997, J. Artif. Intell. Res..

[36]  Sarit Kraus,et al.  Communicating with Unknown Teammates , 2014, ECAI.

[37]  Victor R. Lesser,et al.  Designing a Family of Coordination Algorithms , 1997, ICMAS.

[38]  K. Pearson,et al.  Biometrika , 1902, The American Naturalist.

[39]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[40]  中山 幹夫,et al.  Games and Economic Behavior of Bounded Rationality , 2016 .

[41]  Richard Gonzalez,et al.  Computational Models for the Combination of Advice and Individual Learning , 2009, Cogn. Sci..

[42]  L. Goddard,et al.  Operations Research (OR) , 2007 .

[43]  J. Gittins,et al.  A dynamic allocation index for the discounted multiarmed bandit problem , 1979 .