On Learning the cμ Rule: Single and Multiserver Settings

We consider learning-based variants of the $c \mu$ rule -- a classic and well-studied scheduling policy -- in single and multi-server settings for multi-class queueing systems. In the single server setting, the $c \mu$ rule is known to minimize the expected holding-cost (weighted queue-lengths summed both over classes and time). We focus on the setting where the service rates $\mu$ are unknown, and are interested in the holding-cost regret -- the difference in the expected holding-costs between that induced by a learning-based rule (that learns $\mu$) and that from the $c \mu$ rule (which has knowledge of the service rates) over any fixed time horizon. We first show that empirically learning the service rates and then scheduling using these learned values results in a regret of holding-cost that does not depend on the time horizon. The key insight that allows such a constant regret bound is that a work-conserving scheduling policy in this setting allows explore-free learning, where no penalty is incurred for exploring and learning server rates. We next consider the multi-server setting. We show that in general, the $c \mu$ rule is not stabilizing (i.e. there are stabilizable arrival and service rate parameters for which the multi-server $c \mu$ rule results in unstable queues). We then characterize sufficient conditions for stability (and also concentrations on busy periods). Using these results, we show that learning-based variants of the $c\mu$ rule again result in a constant regret (i.e. does not depend on the time horizon). This result hinges on (i) the busy period concentrations of the multi-server $c \mu$ rule, and that (ii) our learning-based rule is designed to dynamically explore server rates, but in such a manner that it eventually satisfies an explore-free condition.

[1]  R. Tweedie,et al.  Strengthening ergodicity to geometric ergodicity for markov chains , 1994 .

[2]  J.N. Tsitsiklis,et al.  A structured multiarmed bandit problem and the greedy policy , 2008, 2008 47th IEEE Conference on Decision and Control.

[3]  Sanjay Shakkottai,et al.  Regret of Queueing Bandits , 2016, NIPS.

[4]  J. Walrand,et al.  The cμ rule revisited , 1985, Advances in Applied Probability.

[5]  Kevin D. Glazebrook,et al.  Whittle's index policy for a multi-class queueing system with convex holding costs , 2003, Math. Methods Oper. Res..

[6]  Sampath Kannan,et al.  A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem , 2018, NeurIPS.

[7]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[8]  M. Kijima,et al.  FURTHER RESULTS FOR DYNAMIC SCHEDULING OF MULTICLASS G/G/1 QUEUES , 1989 .

[9]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[10]  Khashayar Khosravi,et al.  Exploiting the Natural Exploration In Contextual Bandits , 2017, ArXiv.

[11]  J.M. Schopf,et al.  Stochastic Scheduling , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[12]  Ronald J. Williams,et al.  Dynamic scheduling of a system with two parallel servers in heavy traffic with resource pooling: asymptotic optimality of a threshold policy , 2001 .

[13]  J. V. Mieghem Dynamic Scheduling with Convex Delay Costs: The Generalized CU Rule , 1995 .

[14]  Alexander L. Stolyar,et al.  Scheduling Flexible Servers with Convex Delay Costs: Heavy-Traffic Optimality of the Generalized cµ-Rule , 2004, Oper. Res..

[15]  J. Harrison Heavy traffic analysis of a system with parallel servers: asymptotic optimality of discrete-review policies , 1998 .

[16]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[17]  Demosthenis Teneketzis,et al.  ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES , 2000 .

[18]  G. Klimov Time-Sharing Service Systems. I , 1975 .

[19]  J. Dedecker,et al.  Subgaussian concentration inequalities for geometrically ergodic Markov chains , 2014, 1412.1794.

[20]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[21]  B. Hajek Hitting-time and occupation-time bounds implied by drift analysis with applications , 1982, Advances in Applied Probability.

[22]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[23]  José Niño-Mora Stochastic Scheduling , 2009, Encyclopedia of Optimization.

[24]  Kevin D. Glazebrook,et al.  Parallel Scheduling of Multiclass M/M/m Queues: Approximate and Heavy-Traffic Optimization of Achievable Performance , 2001, Oper. Res..