论文信息 - Delay and Cooperation in Nonstochastic Bandits

Delay and Cooperation in Nonstochastic Bandits

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than d hops to arrive, whered is a delay parameter. We introduce EXP3-COOP, a cooperative version of the EXP3 algorithm and prove that with K actions and N agents the average per-agent regret after T rounds is at most of order q d + 1 + K d (T lnK), where d is the independence number of the d-th power of the communication graphG. We then show that for any connected graph, ford = p K the regret bound isK 1=4 p T , strictly better than the minimax regret p KT for noncooperating agents. More informed choices ofd lead to bounds which are arbitrarily close to the full information minimax regret p T lnK when G is dense. When G has sparse components, we show that a variant of EXP3-COOP, allowing agents to choose their parameters according to their centrality inG, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

[1] Nathan Linial,et al. Locality in Distributed Graph Algorithms , 1992, SIAM J. Comput..

[2] Julie Haviland,et al. Independence and Average Distance in Graphs , 1997, Discret. Appl. Math..

[3] Erik Ordentlich,et al. On delayed prediction of individual sequences , 2002, IEEE Trans. Inf. Theory.

[4] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6] Robert D. Kleinberg,et al. Competitive collaborative learning , 2005, Journal of computer and system sciences (Print).

[7] Chris Mesterharm,et al. On-line Learning with Delayed Label Feedback , 2005, ALT.

[8] John Langford,et al. Slow Learners are Fast , 2009, NIPS.

[9] Éva Tardos,et al. Multiplicative updates outperform generic no-regret learning in congestion games: extended abstract , 2009, STOC '09.

[10] H. Vincent Poor,et al. Bandit problems in networks: Asymptotically efficient distributed allocation rules , 2011, IEEE Conference on Decision and Control and European Control Conference.

[11] Samuel Barrett and Peter Stone. Ad Hoc Teamwork Modeled with Multi-armed Bandits: An Extension to Discounted Infinite Rewards , 2011 .

[12] John Langford,et al. Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[13] John C. Duchi,et al. Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[14] Nicholas R. Jennings,et al. DCOPs and bandits: exploration and exploitation in decentralised coordination , 2012, AAMAS.

[15] Artur Ziviani,et al. DACCER: Distributed Assessment of the Closeness CEntrality Ranking in complex networks , 2013, Comput. Networks.

[16] András György,et al. Online Learning under Delayed Feedback , 2013, ICML.

[17] István Hegedüs,et al. Gossip-based distributed stochastic bandit algorithms , 2013, ICML.

[18] Jukka Suomela,et al. Survey of local algorithms , 2013, CSUR.

[19] Michael I. Jordan,et al. Estimation, Optimization, and Parallelism when Data is Sparse , 2013, NIPS.

[20] Mihaela van der Schaar,et al. Distributed Online Learning in Social Recommender Systems , 2013, IEEE Journal of Selected Topics in Signal Processing.

[21] Matthew J. Streeter,et al. Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[22] Rémi Munos,et al. Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[23] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[24] Koby Crammer,et al. Prediction with Limited Advice and Multiarmed Bandits with Paid Observations , 2014, ICML.

[25] Kent Quanrud,et al. Online Learning with Adversarial Delays , 2015, NIPS.

[26] Stephen J. Wright,et al. An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[27] Mihaela van der Schaar,et al. Distributed Online Learning via Cooperative Contextual Bandits , 2013, IEEE Transactions on Signal Processing.

[28] Dimitris S. Papailiopoulos,et al. Parallel Correlation Clustering on Big Graphs , 2015, NIPS.

[29] John C. Duchi,et al. Asynchronous stochastic convex optimization , 2015, 1508.00882.

[30] Gergely Neu,et al. Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[31] Landgren Peter,et al. On distributed cooperative decision-making in multiarmed bandits , 2016 .

[32] Ohad Shamir,et al. Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[33] Vaibhav Srivastava,et al. On distributed cooperative decision-making in multiarmed bandits , 2015, 2016 European Control Conference (ECC).

[34] Aditya Gopalan,et al. Collaborative learning of stochastic bandits over a social network , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[35] András György,et al. Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms , 2016, AAAI.

[36] Noga Alon,et al. Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback , 2014, SIAM J. Comput..

[37] Csaba Szepesvári,et al. Bandits with Delayed Anonymous Feedback , 2017, ArXiv.

[38] Dimitris S. Papailiopoulos,et al. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[39] Csaba Szepesvári,et al. Bandits with Delayed, Aggregated Anonymous Feedback , 2017, ICML.