DCOPs and bandits: exploration and exploitation in decentralised coordination

Real life coordination problems are characterised by stochasticity and a lack of a priori knowledge about the interactions between agents. However, decentralised constraint optimisation problems (DCOPs), a widely adopted framework for modelling decentralised coordination tasks, assumes perfect knowledge of these factors, thus limiting its practical applicability. To address this shortcoming, we introduce the MAB--DCOP, in which the interactions between agents are modelled by multi-armed bandits (MABs). Unlike canonical DCOPs, a MAB--DCOP is not a single shot optimisation problem. Rather, it is a sequential one in which agents need to coordinate in order to strike a balance between acquiring knowledge about the a priori unknown and stochastic interactions (exploration), and taking the currently believed optimal joint action (exploitation), so as to maximise the cumulative global utility over a finite time horizon. We propose Heist, the first asymptotically optimal algorithm for coordination under stochasticity and lack of prior knowledge. Heist solves MAB--DCOPs in a decentralised fashion using a generalised distributive law (GDL) message passing phase to find the joint action with the highest upper confidence bound (UCB) on global utility. We demonstrate that Heist outperforms other state of the art techniques from the MAB and DCOP literature by up to 1.5 orders of magnitude on MAB--DCOPs in experimental settings.

[1]  Makoto Yokoo,et al.  When should there be a "Me" in "Team"?: distributed multi-agent optimization under uncertainty , 2010, AAMAS.

[2]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[3]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[4]  Milind Tambe,et al.  A Family of Graphical-Game-Based Algorithms for Distributed Constraint Optimization Problems , 2006 .

[5]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[6]  Makoto Yokoo,et al.  DCOPs meet the realworld: exploring unknown reward matrices with applications to mobile sensor networks , 2009, IJCAI 2009.

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Boi Faltings,et al.  E[DPOP]: Distributed Constraint Optimization under Stochastic Uncertainty using Collaborative Sampling , 2009, IJCAI 2009.

[9]  L. Goddard Information Theory , 1962, Nature.

[10]  Boi Faltings,et al.  Distributed Constraint Optimization Under Stochastic Uncertainty , 2011, AAAI.

[11]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[12]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Nicholas R. Jennings,et al.  Bounded approximate decentralised coordination via the max-sum algorithm , 2009, Artif. Intell..

[15]  David Elkind,et al.  Learning: An Introduction , 1968 .

[16]  Boi Faltings,et al.  A Scalable Method for Multiagent Constraint Optimization , 2005, IJCAI.

[17]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[19]  Makoto Yokoo,et al.  Adopt: asynchronous distributed constraint optimization with quality guarantees , 2005, Artif. Intell..

[20]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[21]  Stephen Fitzpatrick,et al.  Distributed Coordination through Anarchic Optimization , 2003 .

[22]  Keith S. Decker,et al.  Coordination for uncertain outcomes using distributed neighbor exchange , 2010, AAMAS.