Delay and Cooperation in Nonstochastic Bandits

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, whered is a delay parameter. We introduce EXP3-COOP, a cooperative version of the EXP3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order q d + 1 + K d (T lnK), where d is the independence number of the d-th power of the communication graphG. We then show that for any connected graph, ford = p K the regret bound isK 1=4 p T , strictly better than the minimax regret p KT for noncooperating agents. More informed choices ofd lead to bounds which are arbitrarily close to the full information minimax regret p T lnK when G is dense. When G has sparse components, we show that a variant of EXP3-COOP, allowing agents to choose their parameters according to their centrality inG, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

[1]  Nathan Linial,et al.  Locality in Distributed Graph Algorithms , 1992, SIAM J. Comput..

[2]  Julie Haviland,et al.  Independence and Average Distance in Graphs , 1997, Discret. Appl. Math..

[3]  Erik Ordentlich,et al.  On delayed prediction of individual sequences , 2002, IEEE Trans. Inf. Theory.

[4]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Robert D. Kleinberg,et al.  Competitive collaborative learning , 2005, Journal of computer and system sciences (Print).

[7]  Chris Mesterharm,et al.  On-line Learning with Delayed Label Feedback , 2005, ALT.

[8]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[9]  Éva Tardos,et al.  Multiplicative updates outperform generic no-regret learning in congestion games: extended abstract , 2009, STOC '09.

[10]  H. Vincent Poor,et al.  Bandit problems in networks: Asymptotically efficient distributed allocation rules , 2011, IEEE Conference on Decision and Control and European Control Conference.

[11]  Samuel Barrett and Peter Stone Ad Hoc Teamwork Modeled with Multi-armed Bandits: An Extension to Discounted Infinite Rewards , 2011 .

[12]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[13]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[14]  Nicholas R. Jennings,et al.  DCOPs and bandits: exploration and exploitation in decentralised coordination , 2012, AAMAS.

[15]  Artur Ziviani,et al.  DACCER: Distributed Assessment of the Closeness CEntrality Ranking in complex networks , 2013, Comput. Networks.

[16]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[17]  István Hegedüs,et al.  Gossip-based distributed stochastic bandit algorithms , 2013, ICML.

[18]  Jukka Suomela,et al.  Survey of local algorithms , 2013, CSUR.

[19]  Michael I. Jordan,et al.  Estimation, Optimization, and Parallelism when Data is Sparse , 2013, NIPS.

[20]  Mihaela van der Schaar,et al.  Distributed Online Learning in Social Recommender Systems , 2013, IEEE Journal of Selected Topics in Signal Processing.

[21]  Matthew J. Streeter,et al.  Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[22]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[23]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[24]  Koby Crammer,et al.  Prediction with Limited Advice and Multiarmed Bandits with Paid Observations , 2014, ICML.

[25]  Kent Quanrud,et al.  Online Learning with Adversarial Delays , 2015, NIPS.

[26]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[27]  Mihaela van der Schaar,et al.  Distributed Online Learning via Cooperative Contextual Bandits , 2013, IEEE Transactions on Signal Processing.

[28]  Dimitris S. Papailiopoulos,et al.  Parallel Correlation Clustering on Big Graphs , 2015, NIPS.

[29]  John C. Duchi,et al.  Asynchronous stochastic convex optimization , 2015, 1508.00882.

[30]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[31]  Landgren Peter,et al.  On distributed cooperative decision-making in multiarmed bandits , 2016 .

[32]  Ohad Shamir,et al.  Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[33]  Vaibhav Srivastava,et al.  On distributed cooperative decision-making in multiarmed bandits , 2015, 2016 European Control Conference (ECC).

[34]  Aditya Gopalan,et al.  Collaborative learning of stochastic bandits over a social network , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[35]  András György,et al.  Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms , 2016, AAAI.

[36]  Noga Alon,et al.  Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback , 2014, SIAM J. Comput..

[37]  Csaba Szepesvári,et al.  Bandits with Delayed Anonymous Feedback , 2017, ArXiv.

[38]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[39]  Csaba Szepesvári,et al.  Bandits with Delayed, Aggregated Anonymous Feedback , 2017, ICML.