On the Combinatorial Multi-Armed Bandit Problem with Markovian Rewards

We consider a combinatorial generalization of the classical multi-armed bandit problem that is defined as follows. There is a given bipartite graph of M users and N¡YM resources. For each user-resource pair (i,j), there is an associated state that evolves as an aperiodic irreducible finite-state Markov chain with unknown parameters, with transitions occurring each time the particular user i is allocated resource j. The user i receives a reward that depends on the corresponding state each time it is allocated the resource j. The system objective is to learn the best matching of users to resources so that the long-term sum of the rewards received by all users is maximized. This corresponds to minimizing regret, defined here as the gap between the expected total reward that can be obtained by the best-possible static matching and the expected total reward that can be achieved by a given algorithm. We present a polynomial-storage and polynomial-complexity-per-step matching-learning algorithm for this problem. We show that this algorithm can achieve a regret that is uniformly arbitrarily close to logarithmic in time and polynomial in the number of users and resources. This formulation is broadly applicable to scheduling and switching problems in communication networks including cognitive radio networks and significantly extends prior results in the area.

[1]  Qing Zhao,et al.  Decentralized multi-armed bandit with multiple distributed players , 2010, 2010 Information Theory and Applications Workshop (ITA).

[2]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[3]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  Mingyan Liu,et al.  Online learning in opportunistic spectrum access: A restless bandit approach , 2010, 2011 Proceedings IEEE INFOCOM.

[6]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[7]  Deepayan Chakrabarti,et al.  Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[8]  Ao Tang,et al.  Opportunistic Spectrum Access with Multiple Users: Learning under Competition , 2010, 2010 Proceedings IEEE INFOCOM.

[9]  Neil Genzlinger A. and Q , 2006 .

[10]  Qing Zhao,et al.  Logarithmic weak regret of non-Bayesian restless multi-armed bandit , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  W. Marsden I and J , 2012 .

[12]  Mingyan Liu,et al.  Online algorithms for the multi-armed bandit problem with Markovian rewards , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[13]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[14]  Yi Gai,et al.  Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation , 2010, 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).

[15]  Wenhan Dai,et al.  The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret , 2010, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..