Social Learning in Multi Agent Multi Armed Bandits

Motivated by emerging need of learning algorithms for large scale networked and decentralized systems, we introduce a distributed version of the classical stochastic Multi-Arm Bandit (MAB) problem. Our setting consists of a large number of agents n that collaboratively and simultaneously solve the same instance of K armed MAB to minimize the average cumulative regret over all agents. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. Agents in our model are decentralized, namely their actions only depend on their observed history in the past. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The per-agent regret scaling achieved by our algorithm is $\BigO łeft( \fracłceil\fracK n \rceil+łog(n) Δ łog(T) + \fracłog^3(n) łog łog(n) Δ^2 \right) $. Furthermore, any agent in our algorithm communicates (arm-ids to an uniformly and independently chosen agent) only a total of Θ(łog(T))$ times over a time interval of T. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in K), and in communication complexity when compared to the full interaction setting which requires T communication attempts by an agent over T arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in per-agent regret.

[1]  James Aspnes,et al.  An Introduction to Population Protocols , 2007, Bull. EATCS.

[2]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[3]  Claudio Gentile,et al.  Delay and Cooperation in Nonstochastic Bandits , 2016, COLT.

[4]  Li Zhang,et al.  Information sharing in distributed stochastic bandits , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[5]  L. Pitt,et al.  How to Manage Information Sharing in Online Marketplaces – An Exploratory Study , 2015 .

[6]  Laura Ricci,et al.  A peer-to-peer recommender system for self-emerging user communities based on gossip overlays , 2013, J. Comput. Syst. Sci..

[7]  Sanmay Das,et al.  Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits , 2017, IJCAI.

[8]  Ohad Shamir,et al.  Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[9]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[10]  Naumaan Nayyar,et al.  Decentralized Learning for Multiplayer Multiarmed Bandits , 2014, IEEE Transactions on Information Theory.

[11]  Mihaela van der Schaar,et al.  Distributed Online Learning in Social Recommender Systems , 2013, IEEE Journal of Selected Topics in Signal Processing.

[12]  Alan M. Frieze,et al.  The shortest-path problem for graphs with random arc-lengths , 1985, Discret. Appl. Math..

[13]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[14]  Eshcar Hillel,et al.  Distributed Exploration in Multi-Armed Bandits , 2013, NIPS.

[15]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[16]  Robert D. Nowak,et al.  Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[17]  Aditya Gopalan,et al.  Collaborative learning of stochastic bandits over a social network , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[18]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[19]  John C. Duchi,et al.  Asynchronous stochastic convex optimization , 2015, 1508.00882.

[20]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[21]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[22]  Shuai Li,et al.  Collaborative Filtering Bandits , 2015, SIGIR.

[23]  Stefano Ermon,et al.  Best arm identification in multi-armed bandits with delayed feedback , 2018, AISTATS.

[24]  Ken Sugawara,et al.  Foraging behavior of interacting robots with virtual pheromone , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[25]  Varun Kanade,et al.  Distributed Non-Stochastic Experts , 2012, NIPS.

[26]  Shahin Shahrampour,et al.  Distributed Online Optimization in Dynamic Environments Using Mirror Descent , 2016, IEEE Transactions on Automatic Control.

[27]  Varun Kanade,et al.  Decentralized Cooperative Stochastic Multi-armed Bandits , 2018, ArXiv.

[28]  Shie Mannor,et al.  Concurrent Bandits and Cognitive Radio Networks , 2014, ECML/PKDD.

[29]  Glenn Ellison,et al.  Word-of-Mouth Communication and Social Learning , 1995 .

[30]  Nancy A. Lynch,et al.  Collaboratively Learning the Best Option on Graphs, Using Bounded Local Memory , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[31]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[32]  Ananthram Swami,et al.  Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret , 2010, IEEE Journal on Selected Areas in Communications.

[33]  Devavrat Shah,et al.  Gossip Algorithms , 2009, Found. Trends Netw..

[34]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[35]  Setareh Maghsudi,et al.  Game Theoretic Mechanisms for Resource Management in Massive Wireless IoT Systems , 2017, IEEE Communications Magazine.

[36]  Lilian Besson,et al.  What Doubling Tricks Can and Can't Do for Multi-Armed Bandits , 2018, ArXiv.

[37]  Koby Crammer,et al.  Prediction with Limited Advice and Multiarmed Bandits with Paid Observations , 2014, ICML.

[38]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[39]  Shie Mannor,et al.  Multi-user lax communications: A multi-armed bandit approach , 2015, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[40]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[41]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[42]  Amir Leshem,et al.  Distributed Multi-Player Bandits - a Game of Thrones Approach , 2018, NeurIPS.

[43]  Doug Terry,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[44]  Varun Kanade,et al.  Decentralized Cooperative Stochastic Bandits , 2018, NeurIPS.

[45]  Shahin Shahrampour,et al.  Multi-armed bandits in multi-agent networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Richard M. Karp,et al.  Randomized rumor spreading , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[47]  Qing Zhao,et al.  Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics , 2010, IEEE Transactions on Information Theory.

[48]  Claudio Gentile,et al.  A Gang of Bandits , 2013, NIPS.

[49]  Alex Pentland,et al.  Human collective intelligence as distributed Bayesian inference , 2016, ArXiv.

[50]  B. Pittel On spreading a rumor , 1987 .

[51]  Nisheeth K. Vishnoi,et al.  A Distributed Learning Dynamics in Social Groups , 2017, PODC.

[52]  István Hegedüs,et al.  Gossip-based distributed stochastic bandit algorithms , 2013, ICML.

[53]  Vincenzo Moscato,et al.  A collaborative user-centered framework for recommending items in Online Social Networks , 2015, Comput. Hum. Behav..

[54]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[55]  Vaibhav Srivastava,et al.  Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[56]  Baruch Awerbuch,et al.  Competitive collaborative learning , 2005, J. Comput. Syst. Sci..