We study a cooperative multi-agent multi-armed bandits with M agents and K arms. The goal of the agents is to minimized the cumulative regret. We adapt a traditional Thompson Sampling algoirthm under the distributed setting. However, with agent's ability to communicate, we note that communication may further reduce the upper bound of the regret for a distributed Thompson Sampling approach. To further improve the performance of distributed Thompson Sampling, we propose a distributed Elimination based Thompson Sampling algorithm that allow the agents to learn collaboratively. We analyse the algorithm under Bernoulli reward and derived a problem dependent upper bound on the cumulative regret.
[1]
Shipra Agrawal,et al.
Near-Optimal Regret Bounds for Thompson Sampling
,
2017,
J. ACM.
[2]
Shipra Agrawal,et al.
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
,
2011,
COLT.
[3]
Thodoris Lykouris,et al.
Graph regret bounds for Thompson Sampling and UCB
,
2019,
ALT.
[4]
Peter Auer,et al.
Finite-time Analysis of the Multiarmed Bandit Problem
,
2002,
Machine Learning.
[5]
Shipra Agrawal,et al.
Further Optimal Regret Bounds for Thompson Sampling
,
2012,
AISTATS.