Delay tails in MapReduce scheduling

MapReduce/Hadoop production clusters exhibit heavy-tailed characteristics for job processing times. These phenomena are resultant of the workload features and the adopted scheduling algorithms. Analytically understanding the delays under different schedulers for MapReduce can facilitate the design and deployment of large Hadoop clusters. The map and reduce tasks of a MapReduce job have fundamental difference and tight dependence between them, complicating the analysis. This also leads to an interesting starvation problem with the widely used Fair Scheduler due to its greedy approach to launching reduce tasks. To address this issue, we design and implement Coupling Scheduler, which gradually launches reduce tasks depending on map task progresses. Real experiments demonstrate improvements to job response times by up to an order of magnitude. Based on extensive measurements and source code investigations, we propose analytical models for the default FIFO and Fair Scheduler as well as our implemented Coupling Scheduler. For a class of heavy-tailed map service time distributions, i.e., regularly varying of index -a, we derive the distribution tail of the job processing delay under the three schedulers, respectively. The default FIFO Scheduler causes the delay to be regularly varying of index -a+1. Interestingly, we discover a criticality phenomenon for Fair Scheduler, the delay under which can change from regularly varying of index -a to -a+1, depending on the maximum number of reduce tasks of a job. Other more subtle behaviors also exist. In contrast, the delay distribution tail under Coupling Scheduler can be one order lower than Fair Scheduler under some conditions, implying a better performance.

[1]  R. M. Loynes,et al.  The stability of a queue with non-independent inter-arrival and service times , 1962, Mathematical Proceedings of the Cambridge Philosophical Society.

[2]  W. Whitt Embedded renewal processes in the GI/G/s queue , 1972, Journal of Applied Probability.

[3]  A. Pakes ON THE TAILS OF WAITING-TIME DISTRIBUTIONS , 1975 .

[4]  R. Wolff AN UPPER BOUND FOR MULTI-CHANNEL QUEUES , 1977 .

[5]  E. Nummelin Regeneration in tandem queues , 1981, Advances in Applied Probability.

[6]  J. Teugels,et al.  Regular variation: Bounded variation , 1987 .

[7]  K. Sigman Regeneration in tandem queues with multiserver stations , 1988, Journal of Applied Probability.

[8]  M. Meerschaert Regular Variation in R k , 1988 .

[9]  Ronald W. Wolff,et al.  Stochastic Modeling and the Theory of Queues , 1989 .

[10]  R. Núñez Queija,et al.  Processor-Sharing Models for Integrated-Services Networks , 2000 .

[11]  A. P. Zwart,et al.  Tail Asymptotics for the Busy Period in the GI/G/1 Queue , 2001, Math. Oper. Res..

[12]  Allen B. Downey The structural cause of file size distributions , 2001, SIGMETRICS '01.

[13]  Predrag R. Jelenkovic,et al.  Large Deviation Analysis of Subexponential Waiting Times in a Processor-Sharing Queue , 2003, Math. Oper. Res..

[14]  Sem C. Borst,et al.  The impact of the service discipline on delay asymptotics , 2003, Perform. Evaluation.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  S. Zachary A Note on Insensitivity in Stochastic Networks , 2006, Journal of Applied Probability.

[17]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[18]  Self-adaptive admission control policies for resource-sharing systems , 2009 .

[19]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[20]  Guanying Wang,et al.  Using realistic simulation for performance analysis of mapreduce setups , 2009, LSAP '09.

[21]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[22]  Mor Harchol-Balter,et al.  Self-adaptive admission control policies for resource-sharing systems , 2009, SIGMETRICS '09.

[23]  Thomas Sandholm,et al.  Dynamic Proportional Share Scheduling in Hadoop , 2010, JSSPP.

[24]  Kun-Lung Wu,et al.  FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads , 2010, Middleware.

[25]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[26]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[27]  Xiaoqiao Meng,et al.  Coupling task progress for MapReduce resource-aware scheduling , 2013, 2013 Proceedings IEEE INFOCOM.