McTAR: A Multi-Trigger Checkpointing Tactic for Fast Task Recovery in MapReduce

Cloud computing and big data technologies have gained great popularity in recent years. MapReduce is still one of the most efficient and well-adopted computing paradigms for providing big data services. MapReduce applications need to be executed on cloud platform where failures are inevitable. Hadoop is the de facto implementation of MapReduce, but it deploys a coarse grained and unsatisfactory fault tolerant services. The failed tasks are rescheduled from scratch to re-execute from the very beginning, which apparently brings amount of overload for failure recovery, and the whole job would be heavily delayed as failures happen. In this paper, we propose a novel multi-trigger checkpointing approach for fast recovery of MapReduce tasks, named McTAR (a Multi-trigger Checkpointing Tactic for fAst TAsk Recovery). As a finer-grained and better fault tolerance tactic, our McTAR employs multi-trigger checkpoint generation, push-pull combined intermediate data distribution and optimized failure task prediction techniques together to make the recovery task attempt be able to start at a specific progress according to the valid checkpoint for intermediate data. In this way, McTAR could effectively speed up the recovery process of MapReduce jobs and highly reduce the task recovery delay.

[1]  Miguel Correia,et al.  Medusa: An Efficient Cloud Fault-Tolerant MapReduce , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[2]  Gabriel Antoniu,et al.  Chronos: Failure-aware scheduling in shared Hadoop clusters , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[3]  Deepali Vora,et al.  YARN versus MapReduce — A comparative study , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[4]  Ce-Kuen Shieh,et al.  Distributed control framework for mapreduce cloud on cloud computing , 2018, NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium.

[5]  Hao Wang,et al.  BeTL: MapReduce Checkpoint Tactics Beneath the Task Level , 2016, IEEE Transactions on Services Computing.

[6]  Biswanath Mukherjee,et al.  A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications , 2016, IEEE Communications Surveys & Tutorials.

[7]  Weikuan Yu,et al.  Cracking Down MapReduce Failure Amplification through Analytics Logging and Migration , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[8]  Luiz Angelo Steffenel,et al.  Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop , 2015, 2015 IEEE International Conference on Cloud Engineering.

[9]  Jorge-Arnulfo Quiané-Ruiz,et al.  RAFTing MapReduce: Fast recovery on the RAFT , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10]  Rajkumar Buyya,et al.  Software Rejuvenation Based Fault Tolerance Scheme for Cloud Applications , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[11]  Daniel Sun,et al.  Reliability and energy efficiency in cloud computing systems: Survey and taxonomy , 2016, J. Netw. Comput. Appl..

[12]  Jie Wu,et al.  A Self-tuning Failure Detection Scheme for Cloud Computing Service , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13]  Peng Wang,et al.  TRCID: Optimized Task Recovery in MapReduce Based on Checkpointing Intermediate Data , 2017, 2017 IEEE International Conference on Edge Computing (EDGE).

[14]  Miguel Correia,et al.  Chrysaor: Fine-Grained, Fault-Tolerant Cloud-of-Clouds MapReduce , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Haibo Hu,et al.  MapReduce Parallel Programming Model: A State-of-the-Art Survey , 2015, International Journal of Parallel Programming.

[17]  Sara Bouchenak,et al.  Experience with benchmarking dependability and performance of MapReduce systems , 2016, Perform. Evaluation.

[18]  Yogesh L. Simmhan,et al.  Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[19]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[20]  Zbakh Mostapha,et al.  The impact of checkpointing interval selection on the scheduling performance of Hadoop framework , 2018, 2018 6th International Conference on Multimedia Computing and Systems (ICMCS).

[21]  Qin Zheng Improving MapReduce fault tolerance in the cloud , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[22]  Chi-Yi Lin,et al.  On Improving Fault Tolerance for Heterogeneous Hadoop MapReduce Clusters , 2013, 2013 International Conference on Cloud Computing and Big Data.

[23]  Dong Wang,et al.  An empirical study on crash recovery bugs in large-scale distributed systems , 2018, ESEC/SIGSOFT FSE.

[24]  Weikuan Yu,et al.  FARMS: Efficient mapreduce speculation for failure recovery in short jobs , 2017, Parallel Comput..

[25]  Alysson Neves Bessani,et al.  On the Performance of Byzantine Fault-Tolerant MapReduce , 2013, IEEE Transactions on Dependable and Secure Computing.