On Optimality of Myopic Policy for Restless Multi-Armed Bandit Problem: An Axiomatic Approach

Due to its application in numerous engineering problems, the restless multi-armed bandit (RMAB) problem is of fundamental importance in stochastic decision theory. However, solving the RMAB problem is well known to be PSPACE-hard, with the optimal policy usually intractable due to the exponential computation complexity. A natural alternative approach is to seek simple myopic policies which are easy to implement. This paper presents a generic study on the optimality of the myopic policy for the RMAB problem. More specifically, we develop three axioms characterizing a family of generic and practically important functions termed as regular functions. By performing a mathematical analysis based on the developed axioms, we establish the closed-form conditions under which the myopic policy is guaranteed to be optimal. The axiomatic analysis also illuminates important engineering implications of the myopic policy including the intrinsic tradeoff between exploration and exploitation. A case study is then presented to illustrate the application of the derived results in analyzing a class of RMAB problems arising from multi-channel opportunistic access.

[1]  Wenhan Dai,et al.  The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret , 2010, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[3]  Qing Zhao,et al.  Logarithmic weak regret of non-Bayesian restless multi-armed bandit , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Mingyan Liu,et al.  Online learning in opportunistic spectrum access: A restless bandit approach , 2010, 2011 Proceedings IEEE INFOCOM.

[5]  Qi Cheng,et al.  Derandomization of Sparse Cyclotomic Integer Zero Testing , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[6]  Quan Liu,et al.  On Optimality of Greedy Policy for a Class of Standard Reward Function of Restless Multi-armed Bandit Problem , 2011, IET Signal Process..

[7]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[8]  Quan Liu,et al.  On Optimality of Myopic Sensing Policy with Imperfect Sensing in Multi-Channel Opportunistic Access , 2013, IEEE Transactions on Communications.

[9]  Bhaskar Krishnamachari,et al.  Dynamic Multichannel Access With Imperfect Channel State Detection , 2010, IEEE Transactions on Signal Processing.

[10]  Peng Shi,et al.  Approximation algorithms for restless bandit problems , 2007, JACM.

[11]  P. Whittle Multi‐Armed Bandits and the Gittins Index , 1980 .

[12]  Lin Chen,et al.  On the Optimality of Myopic Sensing in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels , 2011, ArXiv.

[13]  Dimitris Bertsimas,et al.  Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic , 2000, Oper. Res..

[14]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[15]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[16]  J. Niño-Mora Restless Bandits , Linear Programming Relaxations and a Primal-Dual Heuristic , 1994 .

[17]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[18]  John N. Tsitsiklis,et al.  The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[19]  Mingyan Liu,et al.  Multi-channel opportunistic access: A case of restless bandits with multiple plays , 2009, 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[20]  Mingyan Liu,et al.  Optimality of Myopic Sensing in Multi-Channel Opportunistic Access , 2008, 2008 IEEE International Conference on Communications.

[21]  Bhaskar Krishnamachari,et al.  On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance , 2007, IEEE Transactions on Wireless Communications.

[22]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[23]  Ness B. Shroff,et al.  Opportunistic scheduling using ARQ feedback in multi-cell downlink , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[24]  Sudipto Guha,et al.  Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).