论文信息 - Distributed Cooperative Decision Making in Multi-agent Multi-armed Bandits

Distributed Cooperative Decision Making in Multi-agent Multi-armed Bandits

We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore-exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore-exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph.

Vaibhav Srivastava | Naomi Ehrich Leonard | Peter Landgren | Peter Landgren | Vaibhav Srivastava

[1] Aditya Gopalan,et al. Stochastic bandits on a social network: Collaborative learning with local information sharing , 2016, ArXiv.

[2] Naomi Ehrich Leonard,et al. Heterogeneous Stochastic Interactions for Multiple Agents in a Multi-armed Bandit Problem , 2016, 2019 18th European Control Conference (ECC).

[3] Vaibhav Srivastava,et al. On distributed linear filtering with noisy communication , 2017, 2017 American Control Conference (ACC).

[4] Naomi Ehrich Leonard,et al. Information Centrality and Ordering of Nodes for Accuracy in Noisy Decision-Making Networks , 2016, IEEE Transactions on Automatic Control.

[5] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6] Yi Gai,et al. Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access , 2014, IEEE Transactions on Signal Processing.

[7] Richard M. Murray,et al. Consensus problems in networks of agents with switching topology and time-delays , 2004, IEEE Transactions on Automatic Control.

[8] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[9] Francesco Bullo,et al. Distributed Control of Robotic Networks , 2009 .

[10] Shahin Shahrampour,et al. Multi-armed bandits in multi-agent networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Paolo Braca,et al. Enforcing Consensus While Monitoring the Environment in Wireless Sensor Networks , 2008, IEEE Transactions on Signal Processing.

[12] Naumaan Nayyar,et al. Decentralized Learning for Multiplayer Multiarmed Bandits , 2014, IEEE Transactions on Information Theory.

[13] Amir Leshem,et al. Distributed Multi-Player Bandits - a Game of Thrones Approach , 2018, NeurIPS.

[14] Vaibhav Srivastava,et al. On distributed cooperative decision-making in multiarmed bandits , 2015, 2016 European Control Conference (ECC).

[15] Varun Kanade,et al. Decentralized Cooperative Stochastic Bandits , 2018, NeurIPS.

[16] Gábor Lugosi,et al. Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[17] Jason R. Marden,et al. Achieving Pareto Optimality Through Distributed Learning , 2011 .

[18] Ananthram Swami,et al. Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[19] Naomi Ehrich Leonard,et al. A Dynamic Observation Strategy for Multi-agent Multi-armed Bandit Problem , 2020, 2020 European Control Conference (ECC).

[20] M. Zelen,et al. Rethinking centrality: Methods and examples☆ , 1989 .

[21] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[23] Vaibhav Srivastava,et al. On optimal foraging and multi-armed bandits , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[24] Franz S. Hover,et al. Autonomous mobile acoustic relay positioning as a multi-armed bandit with switching costs , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25] P. Taylor,et al. Test of optimal sampling by foraging great tits , 1978 .

[26] Vaibhav Srivastava,et al. Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[27] Vaibhav Srivastava,et al. Social Imitation in Cooperative Multiarmed Bandits: Partition-Based Algorithms with Strictly Local Information , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[28] Aurélien Garivier,et al. On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[29] Jason R. Marden,et al. Achieving Pareto Optimality Through Distributed Learning , 2014, SIAM J. Control. Optim..

[30] Béla Bollobás,et al. Random Graphs , 1985 .

[31] Vaibhav Srivastava,et al. On Distributed Multi-Player Multiarmed Bandit Problems in Abruptly Changing Environment , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[32] Qing Zhao,et al. Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[33] Vaibhav Srivastava,et al. Surveillance in an abruptly changing world via multiarmed bandits , 2014, 53rd IEEE Conference on Decision and Control.

[34] J. Walrand,et al. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .