Lazy Restless Bandits for Decision Making with Limited Observation Capability: Applications in Wireless Networks

In this work we formulate the problem of restless multi-armed bandits with cumulative feedback and partially observable states. We call these bandits as lazy restless bandits (LRB) as they are slow in action and allow multiple system state transitions during every decision interval. Rewards for each action are state dependent. The states of arms are hidden from the decision maker. The goal of the decision maker is to choose one of the $M$ arms, at the beginning of each decision interval, such that long term cumulative reward is maximized. This work is motivated from applications in wireless networks such as relay selection, opportunistic channel access and downlink scheduling under evolving channel conditions. The Whittle index policy for solving LRB problem is analyzed. In course of doing so, various structural properties of the value functions are proved. Further, closed form index expressions are provided for two sets of special cases; for general cases, an algorithm for index computation is provided. A comparative study based on extensive numerical simulations is presented; the performances of Whittle index policy and myopic policy are compared with other policies such as uniform random, non-uniform random and round-robin.

[1]  Quan Liu,et al.  On Optimality of Myopic Policy for Opportunistic Access With Nonidentical Channels and Imperfect Sensing , 2014, IEEE Transactions on Vehicular Technology.

[2]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[3]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[4]  Mohamad Assaad,et al.  Asymptotically Optimal Pilot Allocation Over Markovian Fading Channels , 2016, IEEE Transactions on Information Theory.

[5]  Atilla Eryilmaz,et al.  Downlink Scheduling Over Markovian Fading Channels , 2011, IEEE/ACM Transactions on Networking.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[8]  S. N. Merchant,et al.  Relay employment problem for unacknowledged transmissions: Myopic policy and structure , 2017, 2017 IEEE International Conference on Communications (ICC).

[9]  José Niño-Mora A Restless Bandit Marginal Productivity Index for Opportunistic Spectrum Access with Sensing Errors , 2009, NET-COOP.

[10]  F. Richard Yu,et al.  Distributed Optimal Relay Selection in Wireless Cooperative Networks With Finite-State Markov Channels , 2010, IEEE Transactions on Vehicular Technology.

[11]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  E. Feron,et al.  Multi-UAV dynamic routing with partial observations using restless bandit allocation indices , 2008, 2008 American Control Conference.

[14]  S. Ross Quality Control under Markovian Deterioration , 1971 .

[15]  Bhaskar Krishnamachari,et al.  On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance , 2007, IEEE Transactions on Wireless Communications.

[16]  Vivek S. Borkar Whittle Index for Partially Observed Binary Markov Decision Processes , 2017, IEEE Transactions on Automatic Control.

[17]  D. Manjunath,et al.  On the Whittle Index for Restless Multiarmed Hidden Markov Bandits , 2016, IEEE Transactions on Automatic Control.

[18]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[19]  Atilla Eryilmaz,et al.  Asymptotically optimal downlink scheduling over Markovian fading channels , 2012, 2012 Proceedings IEEE INFOCOM.

[20]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[21]  Dimitris Bertsimas,et al.  Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic , 2000, Oper. Res..

[22]  Hong Shen Wang,et al.  Finite-state Markov channel-a useful model for radio communication channels , 1995 .

[23]  William S. Lovejoy,et al.  Some Monotonicity Results for Partially Observed Markov Decision Processes , 1987, Oper. Res..

[24]  Saleem A. Kassam,et al.  Finite-state Markov model for Rayleigh fading channels , 1999, IEEE Trans. Commun..

[25]  D. Manjunath,et al.  A restless bandit with no observable states for recommendation systems and communication link scheduling , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[26]  Vivek S. Borkar,et al.  Opportunistic Scheduling as Restless Bandits , 2017, IEEE Transactions on Control of Network Systems.

[27]  P. Sadeghi,et al.  Finite-state Markov modeling of fading channels - a survey of principles and applications , 2008, IEEE Signal Processing Magazine.

[28]  S. N. Merchant,et al.  Restless bandits with cumulative feedback: Applications in wireless networks , 2018, 2018 IEEE Wireless Communications and Networking Conference (WCNC).

[29]  J. Niño-Mora RESTLESS BANDITS, PARTIAL CONSERVATION LAWS AND INDEXABILITY , 2001 .

[30]  Ness B. Shroff,et al.  Multiuser Scheduling in a Markov-Modeled Downlink Using Randomly Delayed ARQ Feedback , 2012, IEEE Transactions on Information Theory.

[31]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[32]  Mingyan Liu,et al.  Optimality of Myopic Sensing in Multi-Channel Opportunistic Access , 2008, 2008 IEEE International Conference on Communications.

[33]  Dimitris Bertsimas,et al.  Decomposable Markov Decision Processes: A Fluid Optimization Approach , 2016, Oper. Res..

[34]  José Niño-Mora,et al.  Sensor scheduling for hunting elusive hiding targets via whittle's restless bandit index policy , 2011, International Conference on NETwork Games, Control and Optimization (NetGCooP 2011).

[35]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[36]  George E. Monahan,et al.  Optimal Stopping in a Partially Observable Markov Process with Costly Information , 1980, Oper. Res..

[37]  S. Christian Albright,et al.  Structural Results for Partially Observable Markov Decision Processes , 1979, Oper. Res..

[38]  Yong Li,et al.  Cooperative relay selection policy using partially observable Markov decision process , 2011, 2011 Seventh International Conference on Natural Computation.

[39]  Daniel Adelman,et al.  Relaxations of Weakly Coupled Stochastic Dynamic Programs , 2008, Oper. Res..

[40]  J. Nio-Mora An Index Policy for Dynamic Fading-Channel Allocation to Heterogeneous Mobile Users with Partial Observations , 2008, 2008 Next Generation Internet Networks.

[41]  Jeffrey Thomas Hawkins,et al.  A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications , 2003 .

[42]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[43]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[44]  D. Manjunath,et al.  A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems , 2017, COMSNETS.

[45]  Michael J. Neely,et al.  Network utility maximization over partially observable Markovian channels , 2010, 2011 International Symposium of Modeling and Optimization of Mobile, Ad Hoc, and Wireless Networks.

[46]  I. M. Verloop Asymptotically optimal priority policies for indexable and nonindexable restless bandits , 2016, 1609.00563.

[47]  Gang Uk Hwang,et al.  Mathematical modeling of rayleigh fading channels based on finite state markov chains , 2009, IEEE Communications Letters.