Achievable Stability in Redundancy Systems

We consider a system with N~parallel servers where incoming jobs are immediately replicated to, say, d~servers. Each of the N servers has its own queue and follows a FCFS discipline. As soon as the first job replica is completed, the remaining replicas are abandoned. We investigate the achievable stability region for a quite general workload model with different job types and heterogeneous servers, reflecting job-server affinity relations which may arise from data locality issues and soft compatibility constraints. Under the assumption that job types are known beforehand we show for New-Better-than-Used (NBU) distributed speed variations that no replication $(d=1)$ gives a strictly larger stability region than replication $(d>1)$. Strikingly, this does not depend on the underlying distribution of the intrinsic job sizes, but observing the job types is essential for this statement to hold. In case of non-observable job types we show that for New-Worse-than-Used (NWU) distributed speed variations full replication ($d=N$) gives a larger stability region than no replication $(d=1)$.

[1]  J. Michael Harrison,et al.  Heavy traffic resource pooling in parallel‐server systems , 1999, Queueing Syst. Theory Appl..

[2]  Gretchen L. Matthews,et al.  On the service capacity region of accessing erasure coded content , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[3]  Ward Whitt,et al.  Comparison methods for queues and other stochastic models , 1986 .

[4]  Emina Soljanin,et al.  On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems , 2013, IEEE Journal on Selected Areas in Communications.

[5]  Gal Mendelson A Lower Bound on the stability region of Redundancy-d with FIFO service discipline , 2021, Oper. Res. Lett..

[6]  Bilal Zia,et al.  The Abcs of Financial Education: Experimental Evidence on Attitudes, Behavior, and Cognitive Biases , 2015, Manag. Sci..

[7]  Gauri Joshi,et al.  Synergy via Redundancy: Boosting Service Capacity with Adaptive Replication , 2018, PERV.

[8]  Benny Van Houdt,et al.  Performance of Redundancy(d) with Identical/Independent Replicas , 2019, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[9]  Alexander L. Stolyar,et al.  OPTIMAL ROUTING IN OUTPUT-QUEUED FLEXIBLE SERVER SYSTEMS , 2005, Probability in the Engineering and Informational Sciences.

[10]  Benny Van Houdt,et al.  Performance Analysis of Workload Dependent Load Balancing Policies , 2019, Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems.

[11]  Alan Scheller-Wolf,et al.  Redundancy-d: The Power of d Choices for Redundancy , 2017, Oper. Res..

[12]  Gretchen L. Matthews,et al.  Service Rate Region of Content Access from Erasure Coded Storage , 2018, 2018 IEEE Information Theory Workshop (ITW).

[13]  Gauri Joshi,et al.  Efficient redundancy techniques to reduce delay in Cloud systems , 2016 .

[14]  Mihalis G. Markakis,et al.  Learning and Hierarchies in Service Systems , 2019, Manag. Sci..

[15]  Sem C. Borst,et al.  Delta probing policies for redundancy , 2018, Perform. Evaluation.

[16]  N. L. Lawrie,et al.  Comparison Methods for Queues and Other Stochastic Models , 1984 .

[17]  R. Wolff,et al.  Job replication on multiserver systems , 2009, Advances in Applied Probability.

[18]  Onno Boxma,et al.  Redundancy scheduling with scaled Bernoulli service requirements , 2019, Queueing Syst. Theory Appl..

[19]  Ness B. Shroff,et al.  On Delay-Optimal Scheduling in Queueing Systems with Replications , 2016, ArXiv.

[20]  Ger Koole,et al.  Resource allocation in grid computing , 2008, J. Sched..

[21]  Onno Boxma,et al.  Stability of Redundancy Systems with Processor Sharing , 2020, VALUETOOLS.

[22]  Gregory W. Wornell,et al.  Efficient Straggler Replication in Large-Scale Parallel Computing , 2015, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[23]  Fatemeh Kazemi,et al.  Service Rate Region: A New Aspect of Coded Distributed System Design , 2020, ArXiv.

[24]  Felix Poloczek,et al.  Contrasting Effects of Replication in Parallel Systems: From Overload to Underload and Back , 2016, SIGMETRICS.

[25]  Alan Scheller-Wolf,et al.  A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size , 2016, IEEE/ACM Transactions on Networking.

[26]  Urtzi Ayesta,et al.  Improving the Performance of Heterogeneous Data Centers through Redundancy , 2020, Proc. ACM Meas. Anal. Comput. Syst..