Contrasting Effects of Replication in Parallel Systems: From Overload to Underload and Back

Task replication has recently been advocated as a practical solution to reduce latencies in parallel systems. In addition to several convincing empirical studies, analytical results have been provided, yet under some strong assumptions such as independent service times of the replicas, which may lend themselves to some contrasting and perhaps contriving behavior. For instance, under the independence assumption, an overloaded system can be stabilized by a replication factor, but can be sent back in overload through further replication. Motivated by the need to dispense with such common and restricting assumptions, which may cause unexpected behavior, we develop a unified and general theoretical framework to compute tight bounds on the distribution of response times in general replication systems. These results immediately lend themselves to the optimal number of replicas minimizing response time quantiles, depending on the parameters of the system (e.g., the degree of correlation amongst replicas).

[1]  Yuming Jiang,et al.  Non-asymptotic delay bounds for (k, l) fork-join systems and multi-stage fork-join networks , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[2]  Emina Soljanin,et al.  Queues with Redundancy: Latency-Cost Analysis , 2015, PERV.

[3]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[4]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[5]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[6]  Esa Hyytiä,et al.  Round-robin routing policy: value functions and mean performance with job- and server-specific costs , 2013, VALUETOOLS.

[7]  Baochun Li,et al.  RepFlow: Minimizing flow completion times with replicated flows in data centers , 2013, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[8]  Guy Pujolle,et al.  Introduction to queueing networks , 1987 .

[9]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[10]  J. Ben Atkinson,et al.  An Introduction to Queueing Networks , 1988 .

[11]  Anthony Ephremides,et al.  A simple dynamic routing problem , 1980 .

[12]  Mor Harchol-Balter,et al.  Performance Modeling and Design of Computer Systems: Queueing Theory in Action , 2013 .

[13]  Zhe Wu,et al.  CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services , 2015, NSDI.

[14]  Felix Poloczek,et al.  Sharp per-flow delay bounds for bursty arrivals: The case of FIFO, SP, and EDF scheduling , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[15]  Scott T. Leutenegger,et al.  Improving Speedup and Response Times by Replicating Parallel Programs on a SNOW , 2004, JSSPP.

[16]  Felix Poloczek,et al.  Scheduling analysis with martingales , 2014, Perform. Evaluation.

[17]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[18]  Sem C. Borst,et al.  Task Allocation in a Multi-Server System , 2003, J. Sched..

[19]  J. Kingman A martingale inequality in the theory of queues , 1964 .

[20]  Ger Koole,et al.  Resource allocation in grid computing , 2008, J. Sched..

[21]  Miron Livny,et al.  Evaluation of strategies to reduce the impact of machine reclaim in cycle-stealing environments , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Christopher Stewart,et al.  Zoolander: Efficiently Meeting Very Strict, Low-Latency SLOs , 2013, ICAC.

[24]  Brighten Godfrey,et al.  Low latency via redundancy , 2013, CoNEXT.

[25]  Srikanth Kandula,et al.  Speeding up distributed request-response workflows , 2013, SIGCOMM.

[26]  Felix Poloczek,et al.  Computable Bounds in Fork-Join Queueing Systems , 2015, SIGMETRICS.

[27]  Gregory W. Wornell,et al.  Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing , 2015, PERV.

[28]  Z. Liu,et al.  Optimality of the round-robin routing policy , 1992, Journal of Applied Probability.

[29]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[30]  Emina Soljanin,et al.  On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems , 2013, IEEE Journal on Selected Areas in Communications.

[31]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[32]  A. Rényi On the theory of order statistics , 1953 .

[33]  Jeffrey Dean,et al.  Achieving Rapid Response Times in Large Online Services , 2012 .