Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed R(MA)B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. The goal is to sequentially choose actions for arms so as to maximize the expected value of the cumulative rewards collected. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy. Our policy is well-defined even if the underlying MDPs are not indexable. We prove that it is asymptotically optimal when the activation budget and number of arms are scaled up, while keeping their ratio as a constant. For the case when the system parameters are unknown, we develop a learning algorithm. Our learning algorithm uses the principle of optimism in the face of uncertainty and further uses a generative model in order to fully exploit the structure of Occupancy-Measured-Reward Index Policy. We call it the R(MA)B-UCB algorithm. As compared with the existing algorithms, R(MA)B-UCB performs close to an offline optimum policy, and also achieves a sub-linear regret with a low computational complexity. Experimental results show that R(MA)B-UCB outperforms the existing algorithms in both regret and run time.

[1]  I. M. Verloop Asymptotically optimal priority policies for indexable and nonindexable restless bandits , 2016, 1609.00563.

[2]  Ambuj Tewari,et al.  Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems , 2019, NeurIPS.

[3]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[4]  E. Altman Constrained Markov Decision Processes , 1999 .

[5]  Jian Li,et al.  Learning Augmented Index Policy for Optimal Service Placement at the Network Edge , 2021, ArXiv.

[6]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[7]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[8]  Lang Tong,et al.  Deadline scheduling as restless bandits , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9]  Christian Timmerer,et al.  Dynamic adaptive streaming over HTTP dataset , 2012, MMSys '12.

[10]  Milind Tambe,et al.  Beyond "To Act or Not to Act": Fast Lagrangian Approaches to General Multi-Action Restless Bandits , 2021, AAMAS.

[11]  Longbo Huang,et al.  Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits , 2020, NeurIPS.

[12]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[13]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[14]  P. R. Kumar,et al.  Reward Biased Maximum Likelihood Estimation for Reinforcement Learning , 2021, L4DC.

[15]  Shie Mannor,et al.  Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[16]  Mingyan Liu,et al.  Data-Driven Channel Modeling Using Spectrum Measurement , 2015, IEEE Transactions on Mobile Computing.

[17]  Panganamala Ramana Kumar,et al.  Optimizing quality of experience of dynamic video streaming over fading wireless networks , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[18]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[19]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[20]  Albert N. Shiryaev,et al.  Optimal Stopping Rules , 2011, International Encyclopedia of Statistical Science.

[21]  Ambuj Tewari,et al.  Thompson Sampling in Non-Episodic Restless Bandits , 2019, ArXiv.

[22]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[23]  Srinivas Shakkottai,et al.  Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs , 2021, AAAI.

[24]  L. Kallenberg Finite State and Action MDPS , 2003 .

[25]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[26]  Dimitris Bertsimas,et al.  Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic , 2000, Oper. Res..

[27]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[28]  Qing Zhao,et al.  Logarithmic weak regret of non-Bayesian restless multi-armed bandit , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[30]  John N. Tsitsiklis,et al.  The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[31]  José Niño Mora Restless Bandits, Partial Conservation Laws and Indexability , 2000 .

[32]  Haipeng Luo,et al.  Learning Adversarial MDPs with Bandit Feedback and Unknown Transition , 2019, ArXiv.

[33]  Qing Zhao,et al.  Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics , 2010, IEEE Transactions on Information Theory.

[34]  Gabriel Zayas-Cabán,et al.  An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits , 2019, Advances in Applied Probability.

[35]  Sarang Deo,et al.  Improving Health Outcomes Through Better Capacity Allocation in a Community-Based Chronic Care Model , 2013, Oper. Res..

[36]  Peter Auer,et al.  Regret bounds for restless Markov bandits , 2012, Theor. Comput. Sci..

[37]  Mingyan Liu,et al.  Optimality of Myopic Sensing in Multi-Channel Opportunistic Access , 2008, 2008 IEEE International Conference on Communications.

[38]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[39]  Peter I. Frazier,et al.  Restless Bandits with Many Arms: Beating the Central Limit Theorem , 2021, ArXiv.

[40]  P. Frazier,et al.  An Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits , 2017, 1707.00205.

[41]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[42]  Andrew Perrault,et al.  Risk-Aware Interventions in Public Health: Planning with Restless Multi-Armed Bandits , 2021, AAMAS.

[43]  O. Hernández-Lerma,et al.  Further topics on discrete-time Markov control processes , 1999 .

[44]  José Niño-Mora,et al.  Dynamic priority allocation via restless bandit marginal productivity indices , 2007, 2304.06115.

[45]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[46]  David B. Brown,et al.  Index Policies and Performance Bounds for Dynamic Selection Problems , 2020, Manag. Sci..

[47]  Pierluigi Nuzzo,et al.  A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints , 2020, AAAI.

[48]  Mariel S. Lavieri,et al.  Optimal Screening for Hepatocellular Carcinoma: A Restless Bandit Model , 2019, Manuf. Serv. Oper. Manag..

[49]  Mingyan Liu,et al.  Adaptive learning of uncontrolled restless bandits with logarithmic regret , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[50]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[51]  Wenhan Dai,et al.  The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret , 2010, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).