Thompson Sampling for Factored Multi-Agent Bandits

Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighboring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this work, we focus on learning to coordinate. Specifically, we consider the multi-agent multi-armed bandit framework, in which fully cooperative loosely-coupled agents must learn to coordinate their decisions to optimize a common objective. As opposed to in the planning setting, for learning methods it is challenging to establish theoretical guarantees. We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings. We provide a regret bound that is sublinear in time and low-order polynomial in the highest number of actions of a single agent for sparse coordination graphs. Finally, we empirically show that MATS outperforms the state-of-the-art algorithm, MAUCE, on two synthetic benchmarks, a realistic wind farm control task, and a novel benchmark with Poisson distributions.

[1]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[2]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[3]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[4]  Jason R. Marden,et al.  A Model-Free Approach to Wind Farm Control Using Game Theoretic Methods , 2013, IEEE Transactions on Control Systems Technology.

[5]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[6]  Wei Chen,et al.  Combinatorial multi-armed bandit: general framework, results and applications , 2013, ICML 2013.

[7]  Peter Vrancx,et al.  Learning multi-agent state space representations , 2010, AAMAS.

[8]  Craig Boutilier,et al.  Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[9]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[10]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[11]  Nikos A. Vlassis,et al.  Using the Max-Plus Algorithm for Multiagent Decision Making in Coordination Graphs , 2005, BNAIC.

[12]  Mathijs de Weerdt,et al.  Solving Transition-Independent Multi-Agent MDPs with Sparse Interactions , 2015, AAAI.

[13]  Yi Guo,et al.  Fleetwide data-enabled reliability improvement of wind turbines , 2019, Renewable and Sustainable Energy Reviews.

[14]  Shobha Venkataraman,et al.  Context-specific multiagent coordination and planning with factored MDPs , 2002, AAAI/IAAI.

[15]  Marco Wiering,et al.  Multi-Agent Reinforcement Learning for Traffic Light control , 2000 .

[16]  Yaoyu Li,et al.  Yaw-Misalignment and its Impact on Wind Turbine Loads and Wind Farm Power Output , 2016 .

[17]  David J. Lunn,et al.  The BUGS Book: A Practical Introduction to Bayesian Analysis , 2013 .

[18]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[19]  Ann Nowé,et al.  Thompson Sampling for m-top Exploration , 2019, BNAIC/BENELEARN.

[20]  Ann Nowé,et al.  Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies , 2017, ECML/PKDD.

[21]  Frans A. Oliehoek,et al.  Decentralised Online Planning for Multi-Robot Warehouse Commissioning , 2017, AAMAS.

[22]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[23]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[24]  Nikos A. Vlassis,et al.  Sparse cooperative Q-learning , 2004, ICML.

[25]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[26]  Akimichi Takemura,et al.  Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors , 2013, AISTATS.

[27]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[28]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[29]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[30]  Ann Nowé,et al.  Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems , 2018, ICML.