Towards Q-learning the Whittle Index for Restless Bandits

We consider the multi-armed restless bandit problem (RMABP) with an infinite horizon average cost objective. Each arm of the RMABP is associated with a Markov process that operates in two modes: active and passive. At each time slot a controller needs to designate a subset of the arms to be active, of which the associated processes will evolve differently from the passive case. Treated as an optimal control problem, the optimal solution of the RMABP is known to be computationally intractable. In many cases, the Whittle index policy achieves near optimal performance and can be tractably found. Nevertheless, computation of the Whittle indices requires knowledge of the transition matrices of the underlying processes, which are sometimes hidden from decision makers. In this paper, we take first steps towards a tractable and efficient reinforcement learning algorithm for controlling such a system. We setup parallel Q-learning recursions, with each recursion mapping to individual possible values of the Whittle index. We then update these recursions as we control the system, learning an approximation of the Whittle index as time evolves. Tested on several examples, our control outperforms naive priority allocations and nears the performance of the fully-informed Whittle index policy.

[1]  José Niño-Mora,et al.  Dynamic priority allocation via restless bandit marginal productivity indices , 2007, 2304.06115.

[2]  Vivek S. Borkar,et al.  A reinforcement learning algorithm for restless bandits , 2018, 2018 Indian Control Conference (ICC).

[3]  A. V. den Boer,et al.  Dynamic Pricing and Learning: Historical Origins, Current Research, and New Directions , 2013 .

[4]  Sarang Deo,et al.  Improving Health Outcomes Through Better Capacity Allocation in a Community-Based Chronic Care Model , 2013, Oper. Res..

[5]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[6]  Moshe Zukerman,et al.  Asymptotically Optimal Job Assignment for Energy-Efficient Processor-Sharing Server Farms , 2016, IEEE Journal on Selected Areas in Communications.

[7]  I. M. Verloop Asymptotically optimal priority policies for indexable and nonindexable restless bandits , 2016, 1609.00563.

[8]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[9]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[10]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[11]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[12]  Qing Zhao,et al.  Indexability of Restless Bandit Problems and Optimality of Whittle Index for Dynamic Multichannel Access , 2008, IEEE Transactions on Information Theory.

[13]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[14]  P. Taylor,et al.  Restless Bandits in Action: Resource Allocation, Competition and Reservation , 2018 .

[15]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[16]  D. Blackwell Discrete Dynamic Programming , 1962 .