A Reduce Task Scheduler for MapReduce with Minimum Transmission Cost Based on Sampling Evaluation

MapReduce is a popular framework for processing large datasets in parallel over a cluster. It has gained wide attention for its high scalability, reliability and low cost. However, its performance may be degraded by excessive network traffic when processing jobs, for such two problems as data locality in reduce task scheduling and partitioning skew. We propose a Minimum Transmission Cost Reduce task Scheduler (MTCRS) based on sampling evaluation to solve the two problems. The MTCRS takes the waiting time of each reduce task and the transmission cost set as indicators to decide appropriate launching locations for Reduce tasks. The transmission cost set is computed by a mathematical model, in which the parameters are the sizes and the locations of intermediate data partitions generated by Average Reservoir Sampling (ARS) algorithm. The experiments show that the MTCRS reduces network traffic by 8.4% compared with Fair scheduler.

[1]  Daniel Gooch,et al.  Communications of the ACM , 2011, XRDS.

[2]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[3]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[4]  Xiaoqiao Meng,et al.  Coupling task progress for MapReduce resource-aware scheduling , 2013, 2013 Proceedings IEEE INFOCOM.

[5]  Wei-Kuan Shih,et al.  LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[6]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Jyh-Biau Chang,et al.  Variable-sized map and locality-aware reduce on public-resource grids , 2011, Future Gener. Comput. Syst..

[9]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[10]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[11]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[12]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[13]  Mohammad Hammoud,et al.  Locality-Aware Reduce Task Scheduling for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[14]  Mohammad Hammoud,et al.  Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[15]  Nesime Tatbul,et al.  Proceedings of the VLDB Endowment , 2011 .