论文信息 - Thompson Sampling for Factored Multi-Agent Bandits - 字舞流文

Thompson Sampling for Factored Multi-Agent Bandits

Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighboring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this work, we focus on learning to coordinate. Specifically, we consider the multi-agent multi-armed bandit framework, in which fully cooperative loosely-coupled agents must learn to coordinate their decisions to optimize a common objective. As opposed to in the planning setting, for learning methods it is challenging to establish theoretical guarantees. We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings. We provide a regret bound that is sublinear in time and low-order polynomial in the highest number of actions of a single agent for sparse coordination graphs. Finally, we empirically show that MATS outperforms the state-of-the-art algorithm, MAUCE, on two synthetic benchmarks, a realistic wind farm control task, and a novel benchmark with Poisson distributions.

Timothy Verstraeten | Diederik M. Roijers | Eugenio Bargiacchi | Pieter JK Libin | Diederik M Roijers | Ann Now'e | Pieter J. K. Libin | P. Libin | T. Verstraeten | Ann Now'e | Eugenio Bargiacchi | D. Roijers

[1] Carlos Guestrin,et al. Multiagent Planning with Factored MDPs , 2001, NIPS.

[2] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[3] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[4] Jason R. Marden,et al. A Model-Free Approach to Wind Farm Control Using Game Theoretic Methods , 2013, IEEE Transactions on Control Systems Technology.

[5] O. Papaspiliopoulos. High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[6] Wei Chen,et al. Combinatorial multi-armed bandit: general framework, results and applications , 2013, ICML 2013.

[7] Peter Vrancx,et al. Learning multi-agent state space representations , 2010, AAMAS.

[8] Craig Boutilier,et al. Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[9] Bhaskar Krishnamachari,et al. Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[10] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[11] Nikos A. Vlassis,et al. Using the Max-Plus Algorithm for Multiagent Decision Making in Coordination Graphs , 2005, BNAIC.

[12] Mathijs de Weerdt,et al. Solving Transition-Independent Multi-Agent MDPs with Sparse Interactions , 2015, AAAI.

[13] Yi Guo,et al. Fleetwide data-enabled reliability improvement of wind turbines , 2019, Renewable and Sustainable Energy Reviews.

[14] Shobha Venkataraman,et al. Context-specific multiagent coordination and planning with factored MDPs , 2002, AAAI/IAAI.

[15] Marco Wiering,et al. Multi-Agent Reinforcement Learning for Traffic Light control , 2000 .

[16] Yaoyu Li,et al. Yaw-Misalignment and its Impact on Wind Turbine Loads and Wind Farm Power Output , 2016 .

[17] David J. Lunn,et al. The BUGS Book: A Practical Introduction to Bayesian Analysis , 2013 .

[18] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[19] Ann Nowé,et al. Thompson Sampling for m-top Exploration , 2019, BNAIC/BENELEARN.

[20] Ann Nowé,et al. Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies , 2017, ECML/PKDD.

[21] Frans A. Oliehoek,et al. Decentralised Online Planning for Multi-Robot Warehouse Commissioning , 2017, AAMAS.

[22] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[23] Gábor Lugosi,et al. Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[24] Nikos A. Vlassis,et al. Sparse cooperative Q-learning , 2004, ICML.

[25] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[26] Akimichi Takemura,et al. Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors , 2013, AISTATS.

[27] Daphne Koller,et al. Policy Iteration for Factored MDPs , 2000, UAI.

[28] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[29] Carlos Guestrin,et al. Max-norm Projections for Factored MDPs , 2001, IJCAI.

[30] Ann Nowé,et al. Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems , 2018, ICML.