Threshold-based rerouting and replication for resolving job-server affinity relations

We consider a system with several job types and two parallel server pools. Within the pools the servers are homogeneous, but across pools possibly not in the sense that the service speed of a job may depend on its type as well as the server pool. Immediately upon arrival, jobs are assigned to a server pool. This could be based on (partial) knowledge of their type, but such knowledge might not be available. Information about the job type can however be obtained while the job is in service; as the service progresses, the likelihood that the service speed of this job type is low increases, creating an incentive to execute the job on different, possibly faster, server(s). Two policies are considered: reroute the job to the other server pool, or replicate it there. We determine the effective load per server under both the rerouting and replication policy for completely unknown as well as partly known job types. We also examine the impact of these policies on the stability bound, and find that the uncertainty in job types may significantly degrade the performance. For (highly) unbalanced service speeds full replication achieves the largest stability bound while for (nearly) balanced service speeds no replication maximizes the stability bound. Finally, we discuss how the use of threshold-based policies can help improve the expected latency for completely or partly unknown job types.

[1]  W. Whitt Approximations for departure processes and queues in series , 1984 .

[2]  Kristen Gardner,et al.  Smart Dispatching in Heterogeneous Systems , 2019, PERV.

[3]  Mihalis G. Markakis,et al.  Learning and Hierarchies in Service Systems , 2019, Manag. Sci..

[4]  J. Ben Atkinson,et al.  An Introduction to Queueing Networks , 1988 .

[5]  Alexander L. Stolyar,et al.  OPTIMAL ROUTING IN OUTPUT-QUEUED FLEXIBLE SERVER SYSTEMS , 2005, Probability in the Engineering and Informational Sciences.

[6]  Gauri Joshi,et al.  Efficient redundancy techniques to reduce delay in Cloud systems , 2016 .

[7]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[8]  Emina Soljanin,et al.  Effective Straggler Mitigation: Which Clones Should Attack and When? , 2017, PERV.

[9]  Gabriel R. Bitran,et al.  Multiproduct Queueing Networks With Deterministic Routing: Decomposition Approach and the Notion of Interference , 2015 .

[10]  Elena Yudovina,et al.  Stochastic networks , 1995, Physics Subject Headings (PhySH).

[11]  Alan Scheller-Wolf,et al.  A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size , 2016, IEEE/ACM Transactions on Networking.

[12]  Brighten Godfrey,et al.  Low latency via redundancy , 2013, CoNEXT.

[13]  R. Wolff,et al.  Job replication on multiserver systems , 2009, Advances in Applied Probability.

[14]  Ger Koole,et al.  Resource allocation in grid computing , 2008, J. Sched..

[15]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[16]  Aleksandr Alekseevich Borovkov,et al.  Stochastic processes in queueing theory , 1976 .

[17]  Gauri Joshi Boosting Service Capacity via Adaptive Task Replication , 2017, PERV.

[18]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[19]  Guy Pujolle,et al.  Introduction to queueing networks , 1987 .