Sequential Decision Making With Limited Observation Capability: Application to Wireless Networks

This paper studies a generalized class of restless multi-armed bandits with hidden states and allow cumulative feedback, as opposed to the conventional instantaneous feedback. We call them lazy restless bandits (LRBs) as the events of decision making are sparser than the events of state transition. Hence, feedback after each decision event is the cumulative effect of the following state transition events. The states of arms are hidden from the decision maker and rewards for actions are state dependent. The decision maker needs to choose one arm in each decision interval, such that the long-term cumulative reward is maximized. As the states are hidden, the decision maker maintains and updates its belief about them. It is shown that LRBs admit an optimal policy which has threshold structure in belief space. The Whittle-index policy for solving the LRB problem is analyzed; indexability of LRBs is shown. Further, the closed-form index expressions are provided for two sets of special cases; for more general cases, an algorithm for index computation is provided. An extensive simulation study is presented; Whittle-index, modified Whittle-index, and myopic policies are compared. The Lagrangian relaxation of the problem provides an upper bound on the optimal value function; it is used to assess the degree of sub-optimality various policies.

[1]  Dimitris Bertsimas,et al.  Conservation Laws, Extended Polymatroids and Multiarmed Bandit Problems; A Polyhedral Approach to Indexable Systems , 1996, Math. Oper. Res..

[2]  Mingyan Liu,et al.  Optimality of Myopic Sensing in Multi-Channel Opportunistic Access , 2008, 2008 IEEE International Conference on Communications.

[3]  D. Manjunath,et al.  On the Whittle Index for Restless Multiarmed Hidden Markov Bandits , 2016, IEEE Transactions on Automatic Control.

[4]  J. Nio-Mora An Index Policy for Dynamic Fading-Channel Allocation to Heterogeneous Mobile Users with Partial Observations , 2008, 2008 Next Generation Internet Networks.

[5]  Dimitris Bertsimas,et al.  Decomposable Markov Decision Processes: A Fluid Optimization Approach , 2016, Oper. Res..

[6]  José Niño-Mora,et al.  Sensor scheduling for hunting elusive hiding targets via whittle's restless bandit index policy , 2011, International Conference on NETwork Games, Control and Optimization (NetGCooP 2011).

[7]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[8]  Jeffrey Thomas Hawkins,et al.  A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications , 2003 .

[9]  Michael N. Katehakis,et al.  Linear Programming for Finite State Multi-Armed Bandit Problems , 1986, Math. Oper. Res..

[10]  José Niño-Mora A Restless Bandit Marginal Productivity Index for Opportunistic Spectrum Access with Sensing Errors , 2009, NET-COOP.

[11]  S. Ross Quality Control under Markovian Deterioration , 1971 .

[12]  I. M. Verloop Asymptotically optimal priority policies for indexable and nonindexable restless bandits , 2016, 1609.00563.

[13]  Gang Uk Hwang,et al.  Mathematical modeling of rayleigh fading channels based on finite state markov chains , 2009, IEEE Communications Letters.

[14]  J. Niño-Mora RESTLESS BANDITS, PARTIAL CONSERVATION LAWS AND INDEXABILITY , 2001 .

[15]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[16]  Atilla Eryilmaz,et al.  Asymptotically optimal downlink scheduling over Markovian fading channels , 2012, 2012 Proceedings IEEE INFOCOM.

[17]  Keith D. Kastella,et al.  Foundations and Applications of Sensor Management , 2010 .

[18]  Michael J. Neely,et al.  Network utility maximization over partially observable Markovian channels , 2013, Perform. Evaluation.

[19]  David B. Brown,et al.  Index Policies and Performance Bounds for Dynamic Selection Problems , 2020, Manag. Sci..

[20]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[21]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22]  Yong Li,et al.  Cooperative relay selection policy using partially observable Markov decision process , 2011, 2011 Seventh International Conference on Natural Computation.

[23]  Daniel Adelman,et al.  Relaxations of Weakly Coupled Stochastic Dynamic Programs , 2008, Oper. Res..

[24]  Lodewijk C. M. Kallenberg,et al.  A Note on M. N. Katehakis' and Y.-R. Chen's Computation of the Gittins Index , 1986, Math. Oper. Res..

[25]  William S. Lovejoy,et al.  Some Monotonicity Results for Partially Observed Markov Decision Processes , 1987, Oper. Res..

[26]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[27]  Saleem A. Kassam,et al.  Finite-state Markov model for Rayleigh fading channels , 1999, IEEE Trans. Commun..

[28]  D. Manjunath,et al.  A restless bandit with no observable states for recommendation systems and communication link scheduling , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[29]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[30]  Dimitris Bertsimas,et al.  Conservation laws, extended polymatroids and multi-armed bandit problems: a unified approach to ind exable systems , 2011, IPCO.

[31]  E. Feron,et al.  Multi-UAV dynamic routing with partial observations using restless bandit allocation indices , 2008, 2008 American Control Conference.

[32]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[33]  F. Richard Yu,et al.  Distributed Optimal Relay Selection in Wireless Cooperative Networks With Finite-State Markov Channels , 2010, IEEE Transactions on Vehicular Technology.

[34]  Vivek S. Borkar Whittle Index for Partially Observed Binary Markov Decision Processes , 2017, IEEE Transactions on Automatic Control.

[35]  S. N. Merchant,et al.  Restless bandits with cumulative feedback: Applications in wireless networks , 2018, 2018 IEEE Wireless Communications and Networking Conference (WCNC).

[36]  S. N. Merchant,et al.  Relay employment problem for unacknowledged transmissions: Myopic policy and structure , 2017, 2017 IEEE International Conference on Communications (ICC).

[37]  D. Manjunath,et al.  A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems , 2017, COMSNETS.

[38]  Michael J. Neely,et al.  Network utility maximization over partially observable Markovian channels , 2010, 2011 International Symposium of Modeling and Optimization of Mobile, Ad Hoc, and Wireless Networks.

[39]  Bhaskar Krishnamachari,et al.  On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance , 2007, IEEE Transactions on Wireless Communications.

[40]  Vivek S. Borkar,et al.  Opportunistic Scheduling as Restless Bandits , 2017, IEEE Transactions on Control of Network Systems.

[41]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state-information II. The convexity of the lossfunction , 1969 .

[42]  Mohamad Assaad,et al.  Asymptotically Optimal Pilot Allocation Over Markovian Fading Channels , 2016, IEEE Transactions on Information Theory.

[43]  Quan Liu,et al.  On Optimality of Myopic Policy for Opportunistic Access With Nonidentical Channels and Imperfect Sensing , 2014, IEEE Transactions on Vehicular Technology.

[44]  Atilla Eryilmaz,et al.  Downlink Scheduling Over Markovian Fading Channels , 2011, IEEE/ACM Transactions on Networking.

[45]  Ness B. Shroff,et al.  Multiuser Scheduling in a Markov-Modeled Downlink Using Randomly Delayed ARQ Feedback , 2012, IEEE Transactions on Information Theory.

[46]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[47]  Hong Shen Wang,et al.  Finite-state Markov channel-a useful model for radio communication channels , 1995 .

[48]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[49]  Dimitris Bertsimas,et al.  Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic , 2000, Oper. Res..

[50]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.