On distributed cooperative decision-making in multiarmed bandits

We study the explore-exploit tradeoff in distributed cooperative decision-making using the context of the multiarmed bandit (MAB) problem. For the distributed cooperative MAB problem, we design the cooperative UCB algorithm that comprises two interleaved distributed processes: (i) running consensus algorithms for estimation of rewards, and (ii) upper-confidence-bound-based heuristics for selection of arms. We rigorously analyze the performance of the cooperative UCB algorithm and characterize the influence of communication graph structure on the decision-making performance of the group.

[1]  Vaibhav Srivastava,et al.  Surveillance in an abruptly changing world via multiarmed bandits , 2014, 53rd IEEE Conference on Decision and Control.

[2]  Yi Gai,et al.  Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access , 2014, IEEE Transactions on Signal Processing.

[3]  Vaibhav Srivastava,et al.  On optimal foraging and multi-armed bandits , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[4]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[5]  M. Degroot Reaching a Consensus , 1974 .

[6]  Jie Lin,et al.  Coordination of groups of mobile autonomous agents using nearest neighbor rules , 2003, IEEE Trans. Autom. Control..

[7]  Franz S. Hover,et al.  Autonomous mobile acoustic relay positioning as a multi-armed bandit with switching costs , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[9]  P. Taylor,et al.  Test of optimal sampling by foraging great tits , 1978 .

[10]  Jorge Cortes,et al.  Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms , 2009 .

[11]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[12]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[13]  Richard M. Murray,et al.  Consensus problems in networks of agents with switching topology and time-delays , 2004, IEEE Transactions on Automatic Control.

[14]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[15]  John N. Tsitsiklis,et al.  Problems in decentralized decision making and computation , 1984 .

[16]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[17]  Vaibhav Srivastava,et al.  Modeling Human Decision Making in Generalized Gaussian Multiarmed Bandits , 2013, Proceedings of the IEEE.

[18]  Winter A. Mason,et al.  Collaborative learning in networks , 2011, Proceedings of the National Academy of Sciences.

[19]  E. Feron,et al.  Multi-UAV dynamic routing with partial observations using restless bandit allocation indices , 2008, 2008 American Control Conference.

[20]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[21]  Qing Zhao,et al.  Extended UCB Policy for Multi-Armed Bandit with Light-Tailed Reward Distributions , 2011, ArXiv.

[22]  H. Vincent Poor,et al.  Bandit problems in networks: Asymptotically efficient distributed allocation rules , 2011, IEEE Conference on Decision and Control and European Control Conference.

[23]  B. McCall,et al.  A Sequential Study of Migration and Job Search , 1987, Journal of Labor Economics.

[24]  Ali Jadbabaie,et al.  Non-Bayesian Social Learning , 2011, Games Econ. Behav..

[25]  Hai Jiang,et al.  Medium access in cognitive radio networks: A competitive multi-armed bandit framework , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[26]  Francesco Bullo,et al.  Distributed Control of Robotic Networks , 2009 .

[27]  Matthew O. Jackson,et al.  Naïve Learning in Social Networks and the Wisdom of Crowds , 2010 .

[28]  Naumaan Nayyar,et al.  Decentralized Learning for Multiplayer Multiarmed Bandits , 2014, IEEE Transactions on Information Theory.

[29]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[30]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[31]  Vaibhav Srivastava,et al.  Collective Decision-Making in Ideal Networks: The Speed-Accuracy Tradeoff , 2014, IEEE Transactions on Control of Network Systems.

[32]  Paolo Braca,et al.  Enforcing Consensus While Monitoring the Environment in Wireless Sensor Networks , 2008, IEEE Transactions on Signal Processing.