rTuner: A Performance Enhancement of MapReduce Job

In this paper, we present a novel task scheduling algorithm, called rTuner. The key objective of the rTuner is to enhance the reduce task execution time in heterogeneous environments. Because, the reduce task is a very expensive process. The reduce tasks comprise of three phases, unlike to the map task, namely, copy phase, shuffle phase, and reduce phase. Therefore, the rescheduling a straggler reduce task can negatively affect the performance, if the scheduling algorithms does not analyze the underlying situation. The rTuner analyzes the reduce tasks' straggling reason, and tunes the reduce task. If a reduce task becomes straggler, then rTuner reschedules it in a suitable node depending on the situation. Our benchmark result shows that enhancement of reduce tasks boosts up the CPU elapsed time significantly. Moreover, we show the efficacy of the rTuner by extensive experiment in low-cost commodity hardware. The rTuner is able to improve the total job execution time of MapReduce significantly, either a heterogeneous environment or homogeneous environment. The rTuner is capable of reducing the execution time by 86.86 seconds and 100.67 seconds on an average over the Longest Approximate Time to End (LATE) in homogeneous and heterogeneous environment respectively. In addition, the rTuner is also able to improve the execution time by 142.44 seconds and 132.52 seconds over LATE in homogeneous and heterogeneous environment at the best situation respectively.

[1]  NIDHI TIWARI,et al.  Classification Framework of MapReduce Scheduling Algorithms , 2015, ACM Comput. Surv..

[2]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[4]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[5]  Chita R. Das,et al.  HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[6]  Cong Xu,et al.  CooMR: Cross-task coordination for efficient data management in MapReduce programs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Chen He,et al.  ESAMR: An Enhanced Self-Adaptive MapReduce Scheduling Algorithm , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[8]  Douglas G. Down,et al.  Guidelines for Selecting Hadoop Schedulers Based on System Heterogeneity , 2014, Journal of Grid Computing.

[9]  Rajkumar Buyya,et al.  HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs , 2016, The Journal of Supercomputing.

[10]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[11]  Hai Jin,et al.  Maestro: Replica-Aware Map Scheduling for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Matthieu Simonin,et al.  On the usability of shortest remaining time first policy in shared Hadoop clusters , 2016, SAC.

[14]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[15]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[16]  Quan Chen,et al.  SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[17]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Mohammad Hammoud,et al.  Locality-Aware Reduce Task Scheduling for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[19]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[20]  Quan Chen,et al.  HAT: history-based auto-tuning MapReduce in heterogeneous environments , 2013, The Journal of Supercomputing.

[21]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, IPDPS Workshops.

[22]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[23]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[24]  Douglas G. Down,et al.  COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems , 2014, Future Gener. Comput. Syst..

[25]  Ripon Patgiri,et al.  Big Data: The V's of the Game Changer Paradigm , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[26]  Baogang Wei,et al.  Improving MapReduce Performance with Partial Speculative Execution , 2015, Journal of Grid Computing.