Distributed Cooperative Decision Making in Multi-agent Multi-armed Bandits

We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore-exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore-exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph.

[1]  Aditya Gopalan,et al.  Stochastic bandits on a social network: Collaborative learning with local information sharing , 2016, ArXiv.

[2]  Naomi Ehrich Leonard,et al.  Heterogeneous Stochastic Interactions for Multiple Agents in a Multi-armed Bandit Problem , 2016, 2019 18th European Control Conference (ECC).

[3]  Vaibhav Srivastava,et al.  On distributed linear filtering with noisy communication , 2017, 2017 American Control Conference (ACC).

[4]  Naomi Ehrich Leonard,et al.  Information Centrality and Ordering of Nodes for Accuracy in Noisy Decision-Making Networks , 2016, IEEE Transactions on Automatic Control.

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Yi Gai,et al.  Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access , 2014, IEEE Transactions on Signal Processing.

[7]  Richard M. Murray,et al.  Consensus problems in networks of agents with switching topology and time-delays , 2004, IEEE Transactions on Automatic Control.

[8]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[9]  Francesco Bullo,et al.  Distributed Control of Robotic Networks , 2009 .

[10]  Shahin Shahrampour,et al.  Multi-armed bandits in multi-agent networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Paolo Braca,et al.  Enforcing Consensus While Monitoring the Environment in Wireless Sensor Networks , 2008, IEEE Transactions on Signal Processing.

[12]  Naumaan Nayyar,et al.  Decentralized Learning for Multiplayer Multiarmed Bandits , 2014, IEEE Transactions on Information Theory.

[13]  Amir Leshem,et al.  Distributed Multi-Player Bandits - a Game of Thrones Approach , 2018, NeurIPS.

[14]  Vaibhav Srivastava,et al.  On distributed cooperative decision-making in multiarmed bandits , 2015, 2016 European Control Conference (ECC).

[15]  Varun Kanade,et al.  Decentralized Cooperative Stochastic Bandits , 2018, NeurIPS.

[16]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[17]  Jason R. Marden,et al.  Achieving Pareto Optimality Through Distributed Learning , 2011 .

[18]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[19]  Naomi Ehrich Leonard,et al.  A Dynamic Observation Strategy for Multi-agent Multi-armed Bandit Problem , 2020, 2020 European Control Conference (ECC).

[20]  M. Zelen,et al.  Rethinking centrality: Methods and examples☆ , 1989 .

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[23]  Vaibhav Srivastava,et al.  On optimal foraging and multi-armed bandits , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[24]  Franz S. Hover,et al.  Autonomous mobile acoustic relay positioning as a multi-armed bandit with switching costs , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  P. Taylor,et al.  Test of optimal sampling by foraging great tits , 1978 .

[26]  Vaibhav Srivastava,et al.  Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[27]  Vaibhav Srivastava,et al.  Social Imitation in Cooperative Multiarmed Bandits: Partition-Based Algorithms with Strictly Local Information , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[28]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[29]  Jason R. Marden,et al.  Achieving Pareto Optimality Through Distributed Learning , 2014, SIAM J. Control. Optim..

[30]  Béla Bollobás,et al.  Random Graphs , 1985 .

[31]  Vaibhav Srivastava,et al.  On Distributed Multi-Player Multiarmed Bandit Problems in Abruptly Changing Environment , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[32]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[33]  Vaibhav Srivastava,et al.  Surveillance in an abruptly changing world via multiarmed bandits , 2014, 53rd IEEE Conference on Decision and Control.

[34]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .