Decentralized Multi-Agent Linear Bandits with Safety Constraints

We study decentralized stochastic linear bandits, where a network of $N$ agents acts cooperatively to efficiently solve a linear bandit-optimization problem over a $d$-dimensional space. For this problem, we propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network. At each round of the algorithm each agent chooses its actions following an upper confidence bound (UCB) strategy and agents share information with their immediate neighbors through a carefully designed consensus procedure that repeats over cycles. Our analysis adjusts the duration of these communication cycles ensuring near-optimal regret performance $\mathcal{O}(d\log{NT}\sqrt{NT})$ at a communication rate of $\mathcal{O}(dN^2)$ per round. The structure of the network affects the regret performance via a small additive term - coined the regret of delay - that depends on the spectral gap of the underlying graph. Notably, our results apply to arbitrary network topologies without a requirement for a dedicated agent acting as a server. In consideration of situations with high communication cost, we propose RC-DLUCB: a modification of DLUCB with rare communication among agents. The new algorithm trades off regret performance for a significantly reduced total communication cost of $\mathcal{O}(d^3N^{2.5})$ over all $T$ rounds. Finally, we show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits. For the recently studied problem of linear bandits with unknown linear safety constraints, we propose the first safe decentralized algorithm. Our study contributes towards applying bandit techniques in safety-critical distributed systems that repeatedly deal with unknown stochastic environments. We present numerical simulations for various network topologies that corroborate our theoretical findings.

[1]  Jennifer A. Scott,et al.  Chebyshev acceleration of iterative refinement , 2014, Numerical Algorithms.

[2]  Vaibhav Srivastava,et al.  On distributed cooperative decision-making in multiarmed bandits , 2015, 2016 European Control Conference (ECC).

[3]  Christos Thrampoulidis,et al.  Generalized Linear Bandits with Safety Constraints , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Shuai Li,et al.  Distributed Clustering of Linear Bandits in Peer to Peer Networks , 2016, ICML.

[5]  Varun Kanade,et al.  Decentralized Cooperative Stochastic Bandits , 2018, NeurIPS.

[6]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, International Conference on Artificial Intelligence and Statistics.

[7]  Mohammad Ghavamzadeh,et al.  Stochastic Bandits with Linear Constraints , 2020, AISTATS.

[8]  Shuai Li,et al.  Medicine Rating Prediction and Recommendation in Mobile Social Networks , 2013, GPC.

[9]  Xiaoyu Chen,et al.  Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret , 2019, ArXiv.

[10]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[11]  Stephen P. Boyd,et al.  Fast linear iterations for distributed averaging , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[12]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[13]  Joel W. Burdick,et al.  Stagewise Safe Bayesian Optimization with Gaussian Processes , 2018, ICML.

[14]  Vaibhav Srivastava,et al.  Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[15]  Kia Khezeli,et al.  Safe Linear Stochastic Bandits , 2019, AAAI.

[16]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[17]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[18]  Shie Mannor,et al.  Multi-User Communication Networks: A Coordinated Multi-Armed Bandit Approach , 2018, IEEE/ACM Transactions on Networking.

[19]  István Hegedüs,et al.  Gossip-based distributed stochastic bandit algorithms , 2013, ICML.

[20]  Christos Thrampoulidis,et al.  Regret Bounds for Safe Gaussian Process Bandit Optimization , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[21]  Andreas Krause,et al.  Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics , 2016, Machine Learning.

[22]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[23]  Louis A. Hageman,et al.  Iterative Solution of Large Linear Systems. , 1971 .

[24]  Christos Thrampoulidis,et al.  Linear Stochastic Bandits Under Safety Constraints , 2019, NeurIPS.

[25]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[26]  Christos Thrampoulidis,et al.  Safe Linear Thompson Sampling , 2019, ArXiv.

[27]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[28]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[29]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[30]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .