Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning

We introduce Robust Restless Bandits, a challenging generalization of restless multi-arm bandits (RMAB). RMABs have been widely studied for intervention planning with limited resources. However, most works make the unrealistic assumption that the transition dynamics are known perfectly, restricting the applicability of existing methods to real-world scenarios. To make RMABs more useful in settings with uncertain dynamics: (i) We introduce the Robust RMAB problem and develop solutions for a minimax regret objective when transitions are given by interval uncertainties; (ii) We develop a double oracle algorithm for solving Robust RMABs and demonstrate its effectiveness on three experimental domains; (iii) To enable our double oracle approach, we introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs. RMABPPO hinges on learning an auxiliary “λ-network” that allows each arm’s learning to decouple, greatly reducing sample complexity required for training; (iv) Under minimax regret, the adversary in the double oracle approach is notoriously difficult to implement due to non-stationarity. To address this, we formulate the adversary oracle as a multi-agent reinforcement learning problem and solve it with a multi-agent extension of RMABPPO, which may be of independent interest as the first known algorithm for this setting. Code is available at https://github.com/killian-34/RobustRMAB.

[1]  Craig Boutilier,et al.  Minimax regret based elicitation of generalized additive utilities , 2007, UAI.

[2]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[3]  Bhaskar Krishnamachari,et al.  Restless Poachers: Handling Exploration-Exploitation Tradeoffs in Security Domains , 2016, AAMAS.

[4]  Diego Ruiz-Hernández,et al.  Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach , 2020, Comput. Oper. Res..

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  Archis Ghate,et al.  Lagrangian relaxation and constraint generation for allocation and advanced scheduling , 2012, Comput. Oper. Res..

[7]  Milind Tambe,et al.  Collapsing Bandits and Their Application to Public Health Interventions , 2020, NeurIPS.

[8]  NEURWIN: NEURAL WHITTLE INDEX NETWORK FOR , 2020 .

[9]  Wouter M. Koolen,et al.  Maximin Action Identification: A New Bandit Framework for Games , 2016, COLT.

[10]  Yi Wu,et al.  Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient , 2019, AAAI.

[11]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[12]  Reza Yaesoubi,et al.  Generalized Markov models of infectious disease spread: A novel framework for developing dynamic health policies , 2011, Eur. J. Oper. Res..

[13]  Peter G. Taylor,et al.  Towards Q-learning the Whittle Index for Restless Bandits , 2019, 2019 Australian & New Zealand Control Conference (ANZCC).

[14]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[15]  Rostislav Horcík,et al.  Double Oracle Algorithm for Computing Equilibria in Continuous Games , 2020, AAAI.

[16]  K. Glazebrook,et al.  General notions of indexability for queueing control and asset management , 2011, 1211.1775.

[17]  K. Glazebrook,et al.  Some indexable families of restless bandit problems , 2006, Advances in Applied Probability.

[18]  John N. Tsitsiklis,et al.  The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[19]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[20]  Kobi Cohen,et al.  Learning in Restless Multiarmed Bandits via Adaptive Arm Sequencing Rules , 2021, IEEE Transactions on Automatic Control.

[21]  Olivier Spanjaard,et al.  A double oracle approach to minmax regret optimization problems with interval data , 2017, Eur. J. Oper. Res..

[22]  Umberto Spagnolini,et al.  Optimality of myopic scheduling and whittle indexability for energy harvesting sensors , 2012, 2012 46th Annual Conference on Information Sciences and Systems (CISS).

[23]  Mariel S. Lavieri,et al.  Optimal Screening for Hepatocellular Carcinoma: A Restless Bandit Model , 2019, Manuf. Serv. Oper. Manag..

[24]  Fei Fang,et al.  Robust Reinforcement Learning Under Minimax Regret for Green Security , 2021, UAI.

[25]  Ambuj Tewari,et al.  Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems , 2019, NeurIPS.

[26]  K. Glazebrook,et al.  On the asymptotic optimality of greedy index heuristics for multi-action restless bandits , 2015, Advances in Applied Probability.

[27]  Vincent A. Knight,et al.  Nashpy: A Python library for the computation of Nash equilibria , 2018, J. Open Source Softw..

[28]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[29]  Jeffrey Thomas Hawkins,et al.  A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications , 2003 .

[30]  Avrim Blum,et al.  Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[31]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[32]  Milind Tambe,et al.  Beyond "To Act or Not to Act": Fast Lagrangian Approaches to General Multi-Action Restless Bandits , 2021, AAMAS.

[33]  Arpita Biswas,et al.  Q-Learning Lagrange Policies for Multi-Action Restless Bandits , 2021, KDD.

[34]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[35]  Bo An,et al.  Regret-Based Optimization and Preference Elicitation for Stackelberg Security Games with Uncertainty , 2014, AAAI.

[36]  V. Borkar,et al.  Whittle index based Q-learning for restless bandits with average reward , 2020, Autom..

[37]  Pradeep Varakantham,et al.  Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare , 2021, IJCAI.

[38]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[39]  Lai Wei,et al.  Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret , 2021, ArXiv.

[40]  Daniel Adelman,et al.  Relaxations of Weakly Coupled Stochastic Dynamic Programs , 2008, Oper. Res..