Indexability and Rollout Policy for Multi-State Partially Observable Restless Bandits

Restless multi-armed bandits with partially observable states has applications in communication systems, age of information and recommendation systems. In this paper, we study multi-state partially observable restless bandit models. We consider three different models based on information observable to decision maker—1) no information is observable from actions of a bandit 2) perfect information from bandit is observable only for one action on bandit, there is a fixed restart state, i.e., transition occurs from all other states to that state 3) perfect state information is available to decision maker for both actions on a bandit and there are two restart state for two actions. We develop the structural properties. We also show a threshold type policy and indexability for model 2 and 3. We present Monte Carlo (MC) rollout policy. We use it for whittle index computation in case of model 2. We obtain the concentration bound on value function in terms of horizon length and number of trajectories for MC rollout policy. We derive explicit index formula for model 3. We finally describe Monte Carlo rollout policy for model 1 when it is difficult to show indexability. We demonstrate the numerical examples using myopic policy, Monte Carlo rollout policy and Whittle index policy. We observe that Monte Carlo rollout policy is good competitive policy to myopic.

[1]  Mohamad Assaad,et al.  Asymptotically Optimal Pilot Allocation Over Markovian Fading Channels , 2016, IEEE Transactions on Information Theory.

[2]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[3]  Atilla Eryilmaz,et al.  Asymptotically optimal downlink scheduling over Markovian fading channels , 2012, 2012 Proceedings IEEE INFOCOM.

[4]  William S. Lovejoy,et al.  Some Monotonicity Results for Partially Observed Markov Decision Processes , 1987, Oper. Res..

[5]  Quan Liu,et al.  On Optimality of Myopic Sensing Policy with Imperfect Sensing in Multi-Channel Opportunistic Access , 2013, IEEE Transactions on Communications.

[6]  D. Manjunath,et al.  On the Whittle Index for Restless Multiarmed Hidden Markov Bandits , 2016, IEEE Transactions on Automatic Control.

[7]  E. Feron,et al.  Multi-UAV dynamic routing with partial observations using restless bandit allocation indices , 2008, 2008 American Control Conference.

[8]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[9]  Yi Ouyang,et al.  On the Optimality of Myopic Sensing in Multi-State Channels , 2013, IEEE Transactions on Information Theory.

[10]  Bhaskar Krishnamachari,et al.  On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance , 2007, IEEE Transactions on Wireless Communications.

[11]  D. Manjunath,et al.  A restless bandit with no observable states for recommendation systems and communication link scheduling , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[12]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[13]  Viliam Makis,et al.  Group Maintenance: A Restless Bandits Approach , 2019, INFORMS J. Comput..

[14]  John N. Tsitsiklis,et al.  The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[15]  Mingyan Liu,et al.  Optimality of Myopic Sensing in Multi-Channel Opportunistic Access , 2008, 2008 IEEE International Conference on Communications.

[16]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state-information II. The convexity of the lossfunction , 1969 .

[17]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[18]  Soung Chang Liew,et al.  Partially Observable Minimum-Age Scheduling: The Greedy Policy , 2020, IEEE Transactions on Communications.

[19]  Ananthram Swami,et al.  Decentralized cognitive MAC for opportunistic spectrum access in ad hoc networks: A POMDP framework , 2007, IEEE Journal on Selected Areas in Communications.

[20]  Rahul Meshram,et al.  Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits , 2020, ArXiv.