Impact of MapReduce Task Re-execution Policy on Job Completion Reliability and Job Completion Time

MapReduce has been a worldwide accepted framework for solving data-intensive applications. To prevent MapReduce jobs from being interrupted by node failures which occur frequently in a large-scale MapReduce cluster, current MapReduce implementations, e.g., Hadoop, employ a task re-execution policy (TR policy for short) for MapReduce jobs, i.e., when a map/reduce task of a job fails due to node failure, this policy reperforms the task on another node. However, the impact of the TR policy on job completion reliability and job completion time have not been studied from a theoretical viewpoint, especially when the job is given different characteristics, e.g., different input data sizes, different numbers of reduce tasks, and different intermediate data sizes. In this study, we derive the job completion reliability (JCR for short) of a MapReduce job based on Poisson distributions and analyze the expected job completion time (JCT for short) based on the universal generation function. We use nine settings of task re-execution factor (TR factor for short) to explore the impact of the TR policy on the JCR and JCT of jobs. The results show that the TR policy can effectively improve JCR without significantly prolonging JCT. But there is no single TR factor with which all jobs can achieve a high JCR.

[1]  Karsten Schwan,et al.  PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Magdalena Balazinska,et al.  Astronomy in the Cloud: Using MapReduce for Image Co-Addition , 2010, ArXiv.

[4]  Fang-Yie Leu,et al.  Deriving Job Completion Reliability and Job Energy Consumption for a General MapReduce Infrastructure from Single-Job Perspective , 2013, 2013 27th International Conference on Advanced Information Networking and Applications Workshops.

[5]  J. K. Ord,et al.  Handbook of the Poisson Distribution , 1967 .

[6]  Gregory Levitin,et al.  The Universal Generating Function in Reliability Analysis and Optimization , 2005 .

[7]  Ying Li,et al.  Performance under Failures of MapReduce Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[8]  Jeffrey Dean,et al.  Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[10]  Qin Zheng Improving MapReduce fault tolerance in the cloud , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[11]  Jimmy J. Lin,et al.  Web-scale computer vision using MapReduce for multimedia data mining , 2010, MDMKDD '10.

[12]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[13]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[14]  D. Elmakis,et al.  Redundancy optimization for series-parallel multi-state systems , 1998 .

[15]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[16]  Fang-Yie Leu,et al.  Analyzing job completion reliability and job energy consumption for a general MapReduce infrastructure , 2013, J. High Speed Networks.

[17]  Gregory Levitin,et al.  Service reliability and performance in grid system with star topology , 2007, Reliab. Eng. Syst. Saf..

[18]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[19]  Copyright © Intel Corporation 2008 * Other names and brands may be claimed as the property of others , 2004 .

[20]  Gabriel Antoniu,et al.  Optimizing intermediate data management in MapReduce computations , 2011, CloudCP '11.