NeurWIN: Neural Whittle Index Network For Restless Bandits Via Deep RL

Whittle index policy is a powerful tool to obtain asymptotically optimal solutions for the notoriously intractable problem of restless bandits. However, finding the Whittle indices remains a difficult problem for many practical restless bandits with convoluted transition kernels. This paper proposes NeurWIN, a neural Whittle index network that seeks to learn the Whittle indices for any restless bandits by leveraging mathematical properties of the Whittle indices. We show that a neural network that produces the Whittle index is also one that produces the optimal control for a set of Markov decision problems. This property motivates using deep reinforcement learning for the training of NeurWIN. We demonstrate the utility of NeurWIN by evaluating its performance for three recently studied restless bandit problems. Our experiment results show that the performance of NeurWIN is significantly better than other RL algorithms.

[1]  Qing Zhao,et al.  Multi-Armed Bandits: Theory and Applications to Online Learning in Networks , 2019, Multi-Armed Bandits.

[2]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[3]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[4]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[5]  José Niño-Mora,et al.  Dynamic priority allocation via restless bandit marginal productivity indices , 2007, 2304.06115.

[6]  Vivek S. Borkar,et al.  A reinforcement learning algorithm for restless bandits , 2018, 2018 Indian Control Conference (ICC).

[7]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[8]  D. Manjunath,et al.  On the Whittle Index for Restless Multiarmed Hidden Markov Bandits , 2016, IEEE Transactions on Automatic Control.

[9]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Ling Shi,et al.  Deep Reinforcement Learning for Wireless Sensor Scheduling in Cyber-Physical Systems , 2018, Autom..

[12]  Tomi Silander,et al.  When are Kalman-Filter Restless Bandits Indexable? , 2015, NIPS.

[13]  Bhaskar Krishnamachari,et al.  Deep Reinforcement Learning for Dynamic Multichannel Access in Wireless Networks , 2018, IEEE Transactions on Cognitive Communications and Networking.

[14]  Steffen Grünewälder,et al.  Recovering Bandits , 2019, NeurIPS.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[17]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[18]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[19]  E. Feron,et al.  Multi-UAV dynamic routing with partial observations using restless bandit allocation indices , 2008, 2008 American Control Conference.

[20]  Vivek S. Borkar,et al.  A learning algorithm for the Whittle index policy for scheduling web crawlers , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[21]  Lang Tong,et al.  Deadline scheduling as restless bandits , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[22]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[23]  Samuli Aalto,et al.  Whittle Index Approach to Size-aware Scheduling with Time-varying Channels , 2015, SIGMETRICS.

[24]  V. Borkar,et al.  Whittle index based Q-learning for restless bandits with average reward , 2020, Autom..

[25]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[26]  Pradeep Varakantham,et al.  Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare , 2021, IJCAI.

[27]  Eytan Modiano,et al.  A Whittle Index Approach to Minimizing Functions of Age of Information , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[28]  Andriy Mnih,et al.  Q-Learning in enormous action spaces via amortized approximate maximization , 2020, ArXiv.

[29]  Emma Brunskill,et al.  Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs , 2018, ICML.

[30]  Peter G. Taylor,et al.  Towards Q-learning the Whittle Index for Restless Bandits , 2019, 2019 Australian & New Zealand Control Conference (ANZCC).

[31]  Alessandro Lazaric,et al.  A single algorithm for both restless and rested rotting bandits , 2020, AISTATS.

[32]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[33]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[34]  Tianshu Wei,et al.  Deep reinforcement learning for building HVAC control , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).