论文信息 - Multi-policy posterior sampling for restless Markov bandits

Multi-policy posterior sampling for restless Markov bandits

This paper considers Multi-Arms Restless Bandits problem, where each arm have time varying rewards generated from unknown two-states discrete time Markov process. Each chain is assumed irreducible, aperiodic, and non-reactive to agent actions. Optimal solution or constant value approximation to all instances of Restless Bandits problem does not exist; in fact it has been proven to be intractable even if all parameters were deterministic. A polynomial time algorithm is proposed that learns transitional parameters for each arm and selects the perceived optimal policy from a set of predefined policies using a beliefs or probability distributions. More precisely, the proposed algorithm compares mean rewards of consistently staying with best perceived arm to means rewards of Myopically accessed combination of arms using randomized probability matching or better known as Thompson Sampling. Empirical evaluations are presented at the end of the paper that show an improve performance in all instances of the problem compared to other existing algorithms except a small set of instances where arms are similar and bursty.

Hong Man | Suleman Alnatheer

[1] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[2] Mingyan Liu,et al. Optimality of Myopic Sensing in Multi-Channel Opportunistic Access , 2008, 2008 IEEE International Conference on Communications.

[3] Peng Shi,et al. Approximation algorithms for restless bandit problems , 2007, JACM.

[4] John N. Tsitsiklis,et al. The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[5] P. Whittle. Restless Bandits: Activity Allocation in a Changing World , 1988 .

[6] Bhaskar Krishnamachari,et al. On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance , 2007, IEEE Transactions on Wireless Communications.

[7] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[8] Mingyan Liu,et al. Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[9] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .