Learning Unknown Service Rates in Queues: A Multiarmed Bandit Approach

Consider a queueing system consisting of multiple servers. Jobs arrive over time and enter a queue for service; the goal is to minimize the size of this queue. At each opportunity for service, at most one server can be chosen, and at most one job can be served. Service is successful with a probability (the service probability) that is a priori unknown for each server. An algorithm that knows the service probabilities (the "genie") can always choose the server of highest service probability. We study algorithms that learn the unknown service probabilities. Our goal is to minimize queue-regret: the (expected) difference between the queue-lengths obtained by the algorithm, and those obtained by the "genie." Since queue-regret cannot be larger than classical regret, results for the standard multi-armed bandit problem give algorithms for which queue-regret increases no more than logarithmically in time. Our paper shows surprisingly more complex behavior. In particular, as long as the bandit algorithm's queues have relatively long regenerative cycles, queue-regret is similar to cumulative regret, and scales (essentially) logarithmically. However, we show that this "early stage" of the queueing bandit eventually gives way to a "late stage", where the optimal queue-regret scaling is $O(1/t)$. We demonstrate an algorithm that (order-wise) achieves this asymptotic queue-regret in the late stage. Our results are developed in a more general model that allows for multiple job classes as well.

[1]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[2]  Urtzi Ayesta,et al.  Dynamic Control of Birth-and-Death Restless Bandits: Application to Resource-Allocation Problems , 2016, IEEE/ACM Transactions on Networking.

[3]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[4]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[5]  Ward Whitt,et al.  Heavy Traffic Limit Theorems for Queues: A Survey , 1974 .

[6]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[7]  José Niño-Mora,et al.  Marginal productivity index policies for scheduling a multiclass delay-/loss-sensitive queue , 2006, Queueing Syst. Theory Appl..

[8]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[9]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[10]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[11]  José Niño-Mora,et al.  Dynamic priority allocation via restless bandit marginal productivity indices , 2007, 2304.06115.

[12]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[13]  Jean-Yves Audibert,et al.  Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem , 2013, J. Mach. Learn. Res..

[14]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[15]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[16]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[17]  Jean Walrand,et al.  The c# rule revisited , 1985 .

[18]  Vianney Perchet,et al.  Bounded regret in stochastic multi-armed bandits , 2013, COLT.

[19]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[20]  Sanjay Shakkottai,et al.  Regret of Queueing Bandits , 2016, NIPS.

[21]  Demosthenis Teneketzis,et al.  ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES , 2000 .

[22]  Vianney Perchet,et al.  Combinatorial semi-bandit with known covariance , 2016, NIPS.

[23]  H. Kushner Heavy Traffic Analysis of Controlled Queueing and Communication Networks , 2001 .

[24]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[25]  J. V. Mieghem Dynamic Scheduling with Convex Delay Costs: The Generalized CU Rule , 1995 .

[26]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[27]  Peter Auer,et al.  Regret bounds for restless Markov bandits , 2012, Theor. Comput. Sci..

[28]  José Niño-Mora,et al.  Admission and routing of soft real-time jobs to multiclusters: Design and comparison of index policies , 2012, Comput. Oper. Res..

[29]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[30]  Urtzi Ayesta,et al.  Scheduling of multi-class multi-server queueing systems with abandonments , 2017, J. Sched..

[31]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[32]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[33]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[34]  Lei Ying,et al.  Communication Networks - An Optimization, Control, and Stochastic Networks Perspective , 2014 .

[35]  José Niño Mora Marginal productivity index policies for scheduling a multiclass delay-/loss-sensitive queue , 2005 .

[36]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[37]  P. Jacko,et al.  Congestion control of TCP flows in Internet routers by means of index policy , 2012, Comput. Networks.

[38]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[39]  Michael J. Neely,et al.  Stability and Capacity Regions or Discrete Time Queueing Networks , 2010, ArXiv.