Improving the Performance of Heterogeneous Data Centers through Redundancy

We analyze the performance of redundancy in a multi-type job and multi-type server system. We assume the job dispatcher is unaware of the servers' capacities, and we set out to study under which circumstances redundancy improves the performance. With redundancy an arriving job dispatches redundant copies to all its compatible servers, and departs as soon as one of its copies completes service. As a benchmark comparison, we take the non-redundant system in which a job arrival is routed to only one randomly selected compatible server. Service times are generally distributed and all copies of a job are identical, i.e., have the same service requirement. In our first main result, we characterize the sufficient and necessary stability conditions of the redundancy system. This condition coincides with that of a system where each job type only dispatches copies into its least-loaded servers, and those copies need to be fully served. In our second result, we compare the stability regions of the system under redundancy to that of no redundancy. We show that if the server's capacities are sufficiently heterogeneous, the stability region under redundancy can be much larger than that without redundancy. We apply the general solution to particular classes of systems, including redundancy-d and nested models, to derive simple conditions on the degree of heterogeneity required for redundancy to improve the stability. As such, our result is the first in showing that redundancy can improve the stability and hence performance of a system when copies are non-i.i.d..

[1]  G. Dai A Fluid-limit Model Criterion for Instability of Multiclass Queueing Networks , 1996 .

[2]  Onno Boxma,et al.  Stability of Redundancy Systems with Processor Sharing , 2020, VALUETOOLS.

[3]  Kannan Ramchandran,et al.  The MDS queue: Analysing the latency performance of erasure codes , 2012, 2014 IEEE International Symposium on Information Theory.

[4]  Kannan Ramchandran,et al.  On scheduling redundant requests with cancellation overheads , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Sean P. Meyn,et al.  Generalized Resolvents and Harris Recurrence of Markov Processes , 1992 .

[6]  Regina Robertovna Egorova,et al.  Sojourn time tails in processor-sharing systems , 2009 .

[7]  Lachlan L. H. Andrew,et al.  Network Stability Under Alpha Fair Bandwidth Allocation With General File Size Distribution , 2012, IEEE Transactions on Automatic Control.

[8]  Onno Boxma,et al.  Redundancy scheduling with scaled Bernoulli service requirements , 2019, Queueing Syst. Theory Appl..

[9]  Benny Van Houdt,et al.  Performance Analysis of Workload Dependent Load Balancing Policies , 2019, Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems.

[10]  T. Hellemans,et al.  Analysis of Redundancy(d) with Identical Replicas , 2019, PERV.

[11]  Emina Soljanin,et al.  Queues with Redundancy: Latency-Cost Analysis , 2015, PERV.

[12]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[13]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[14]  Alan Scheller-Wolf,et al.  A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size , 2016, IEEE/ACM Transactions on Networking.

[15]  Alan Scheller-Wolf,et al.  Queueing with redundant requests: exact analysis , 2016, Queueing Syst. Theory Appl..

[16]  Seva Shneer,et al.  Stability of JSQ in queues with general server-job class compatibilities , 2020, Queueing Syst. Theory Appl..

[17]  Thomas Bonald,et al.  Balanced fair resource sharing in computer clusters , 2016, Perform. Evaluation.

[18]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[19]  Nam H. Lee A sufficient condition for stochastic stability of an Internet congestion control model in terms of fluid model stability , 2008 .

[20]  S. Foss,et al.  An Introduction to Heavy-Tailed and Subexponential Distributions , 2011 .

[21]  Brighten Godfrey,et al.  Low latency via redundancy , 2013, CoNEXT.

[22]  Ger Koole,et al.  Resource allocation in grid computing , 2008, J. Sched..

[23]  Urtzi Ayesta,et al.  On the Stability of Redundancy Models , 2019, Oper. Res..

[24]  Upendra Dave,et al.  Applied Probability and Queues , 1987 .

[25]  Mor Harchol-Balter,et al.  Performance Modeling and Design of Computer Systems: Contents , 2013 .

[26]  Esa Hyytiä,et al.  A little redundancy goes a long way: Convexity in redundancy systems , 2019, Perform. Evaluation.

[27]  Philippe Robert,et al.  Fluid Limits for Processor-Sharing Queues with Impatience , 2008, Math. Oper. Res..