A Survey of Stability Results for Redundancy Systems

Redundancy mechanisms consist in sending several copies of a same job to a subset of servers. It constitutes one of the most promising ways to exploit diversity in multi-servers applications. However, its pros and cons are still not sufficiently understood in the context of realistic models with generic statistical properties of service-times distributions and correlation structures of copies. We aim at giving a survey of recent results concerning the stability - arguably the first benchmark of performance - of systems with cancel-on-completion redundancy. We also point out open questions and conjectures.

[1]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[2]  Benny Van Houdt,et al.  Performance Analysis of Workload Dependent Load Balancing Policies , 2019, Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems.

[3]  T. Hellemans,et al.  Analysis of Redundancy(d) with Identical Replicas , 2019, PERV.

[4]  S. Foss,et al.  An Introduction to Heavy-Tailed and Subexponential Distributions , 2011 .

[5]  Seva Shneer,et al.  MDS coding is better than replication for job completion times , 2019, Oper. Res. Lett..

[6]  Emina Soljanin,et al.  Efficient Redundancy Techniques for Latency Reduction in Cloud Systems , 2015, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[7]  Wolfgang Kellerer,et al.  The cost of aggressive HTTP adaptive streaming: Quantifying YouTube's redundant traffic , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[8]  Nihar B. Shah,et al.  When do redundant requests reduce latency ? , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9]  Martin Zubeldia,et al.  Delay-Optimal Policies in Partial Fork-Join Systems with Redundancy and Random Slowdowns , 2020, Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems.

[10]  Alan Scheller-Wolf,et al.  A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size , 2016, IEEE/ACM Transactions on Networking.

[11]  Scott Shenker,et al.  Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly , 2012, HotCloud.

[12]  Urtzi Ayesta,et al.  On the Stability of Redundancy Models , 2019, Oper. Res..

[13]  Thomas Bonald,et al.  Balanced fair resource sharing in computer clusters , 2016, Perform. Evaluation.

[14]  Alan Scheller-Wolf,et al.  Redundancy-d: The Power of d Choices for Redundancy , 2017, Oper. Res..

[15]  Urtzi Ayesta,et al.  A token-based central queue with order-independent service rates , 2019, Oper. Res..

[16]  Achievable Stability in Redundancy Systems , 2020, Proc. ACM Meas. Anal. Comput. Syst..

[17]  Osman T. Akgun,et al.  Partial Flexibility in Routeing and Scheduling , 2013, Advances in Applied Probability.

[18]  Gal Mendelson A Lower Bound on the stability region of Redundancy-d with FIFO service discipline , 2021, Oper. Res. Lett..

[19]  Esa Hyytiä,et al.  A little redundancy goes a long way: Convexity in redundancy systems , 2019, Perform. Evaluation.

[20]  Jeffrey Dean,et al.  Achieving Rapid Response Times in Large Online Services , 2012 .

[21]  C'eline Comte,et al.  Pass-and-swap queues , 2021, Queueing Syst. Theory Appl..

[22]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[23]  Ness B. Shroff,et al.  On Delay-Optimal Scheduling in Queueing Systems with Replications , 2016, ArXiv.

[24]  R. Srikant,et al.  Mean-field-analysis of coding versus replication in cloud storage systems , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[25]  Gideon Weiss,et al.  A product form solution to a system with multi-type jobs and multi-type servers , 2012, Queueing Syst. Theory Appl..

[26]  Onno Boxma,et al.  Redundancy scheduling with scaled Bernoulli service requirements , 2019, Queueing Syst. Theory Appl..

[27]  Benny Van Houdt,et al.  On the Power-of-d-choices with Least Loaded Server Selection , 2018, SIGMETRICS.

[28]  Alan Scheller-Wolf,et al.  Queueing with redundant requests: exact analysis , 2016, Queueing Syst. Theory Appl..

[29]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[30]  Mor Harchol-Balter,et al.  Performance Modeling and Design of Computer Systems: Queueing Theory in Action , 2013 .

[31]  Rhonda Righter,et al.  Product forms for FCFS queueing models with arbitrary server-job compatibilities: an overview , 2020, Queueing Systems.

[32]  Benny Van Houdt,et al.  Performance of Redundancy(d) with Identical/Independent Replicas , 2019, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[33]  Onno Boxma,et al.  Stability of Redundancy Systems with Processor Sharing , 2020, VALUETOOLS.

[34]  Mor Harchol-Balter,et al.  Scheduling for efficiency and fairness in systems with redundancy , 2017, Perform. Evaluation.

[35]  Urtzi Ayesta,et al.  On a unifying product form framework for redundancy models , 2018, Perform. Evaluation.

[36]  ZubeldiaMartin Delay-optimal Policies in Partial Fork-Join Systems with Redundancy and Random Slowdowns , 2020 .

[37]  Lachlan L. H. Andrew,et al.  Network Stability Under Alpha Fair Bandwidth Allocation With General File Size Distribution , 2012, IEEE Transactions on Automatic Control.

[38]  Brighten Godfrey,et al.  More is less: reducing latency via redundancy , 2012, HotNets-XI.

[39]  A. E. Krzesinski,et al.  Order Independent Queues , 2011 .

[40]  Kannan Ramchandran,et al.  The MDS Queue: Analysing the Latency Performance of Erasure Codes , 2017, IEEE Trans. Inf. Theory.

[41]  Eman Almehdawe Queueing Networks: A Fundamental Approach , 2014, J. Oper. Res. Soc..

[42]  S. Borst,et al.  Redundancy Scheduling with Locally Stable Compatibility Graphs. , 2020 .

[43]  Ger Koole,et al.  Resource allocation in grid computing , 2008, J. Sched..

[44]  Brighten Godfrey,et al.  Low latency via redundancy , 2013, CoNEXT.