Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications

Existing Hadoop MapReduce fault tolerance strategy causes the computing jobs suffering from high performance penalty during failure recovery. In this paper, we propose Fast Recovery MapReduce (FAR-MR) to improve MapReduce performance in failure recovery. FAR-MR includes a novel fault tolerance strategy that combines distributed checkpointing and proactive push mechanism to support fast recovery from task failure and node failure. With distributed checkpointing, computing progress of each task is recorded as checkpoints periodically and kept in distributed data storage. The recovered task can obtain the last progress of the failed task from the distributed storage during failure recovery. In addition, the proactive push mechanism enables the computing results of map tasks to be proactively transmitted to the nodes hosting reduce tasks of the same computing job. When a failure happens, the partial output results being pushed to the reducer nodes can be used by the reduce tasks without the necessity of re-compute. FAR-MR allows a failed task to be recovered efficiently at any node in the cluster. The performance evaluation has shown that the proposed FAR-MR can improve computing job performance by up to 62% and 45% compared to Hadoop MapReduce in the case of task failure recovery and node failure recovery, respectively.

[1]  Ying Xing,et al.  A Cooperative, Self-Configuring High-Availability Solution for Stream Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Weikuan Yu,et al.  Cracking Down MapReduce Failure Amplification through Analytics Logging and Migration , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[4]  Chi-Yi Lin,et al.  On Improving Fault Tolerance for Heterogeneous Hadoop MapReduce Clusters , 2013, 2013 International Conference on Cloud Computing and Big Data.

[5]  Quan Chen,et al.  SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[6]  Hao Wang,et al.  ReCT: Improving MapReduce performance under failures with resilient checkpointing tactics , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[7]  Josiah L. Carlson,et al.  Redis in Action , 2013 .

[8]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[9]  Sameh A. Salem,et al.  Mapreduce Performance in Heterogeneous Environments : A Review , 2013 .

[10]  Hao Wang,et al.  BeTL: MapReduce Checkpoint Tactics Beneath the Task Level , 2016, IEEE Transactions on Services Computing.

[11]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[12]  Kyong-Ha Lee,et al.  Parallel labeling of massive XML data with MapReduce , 2014, The Journal of Supercomputing.

[13]  Alvaro A. Cárdenas,et al.  Big Data Analytics for Security , 2013, IEEE Security & Privacy.

[14]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[17]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[18]  Jaspal Subhlok,et al.  Performance Implications of Failures on MapReduce Applications , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[19]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[20]  Antoni Wolski,et al.  SIREN: A Memory-Conserving, Snapshot-Consistent Checkpoint Algorithm for in-Memory Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Umberto Ferraro Petrillo,et al.  An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop , 2017, The Journal of Supercomputing.

[22]  Jun Wang,et al.  VH-DSI: Speeding up Data Visualization via a Heterogeneous Distributed Storage Infrastructure , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[23]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[24]  Keqiu Li,et al.  Optimized big data K-means clustering using MapReduce , 2014, The Journal of Supercomputing.

[25]  Ching-Hsien Hsu,et al.  An improved partitioning mechanism for optimizing massive data analysis using MapReduce , 2013, The Journal of Supercomputing.

[26]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[27]  Yi-Hung Huang,et al.  Feature selection based on an improved cat swarm optimization algorithm for big data classification , 2016, The Journal of Supercomputing.

[28]  Michael Treaster,et al.  A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.

[29]  Jorge-Arnulfo Quiané-Ruiz,et al.  RAFTing MapReduce: Fast recovery on the RAFT , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[30]  Samuel Madden,et al.  Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).