Fault tolerant MapReduce-MPI for HPC clusters

Building MapReduce applications using the Message-Passing Interface (MPI) enables us to exploit the performance of large HPC clusters for big data analytics. However, due to the lacking of native fault tolerance support in MPI and the incompatibility between the MapReduce fault tolerance model and HPC schedulers, it is very hard to provide a fault tolerant MapReduce runtime for HPC clusters. We propose and develop FT-MRMPI, the first fault tolerant MapReduce framework on MPI for HPC clusters. We discover a unique way to perform failure detection and recovery by exploiting the current MPI semantics and the new proposal of user-level failure mitigation. We design and develop the checkpoint/restart model for fault tolerant MapReduce in MPI. We further tailor the detect/resume model to conserve work for more efficient fault tolerance. The experimental results on a 256-node HPC cluster show that FT-MRMPI effectively masks failures and reduces the job completion time by 39%.

[1]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[2]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[3]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2013, ICAC 2013.

[4]  Andrey Tovchigrechko,et al.  Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[5]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[6]  Y. Charlie Hu,et al.  PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters , 2013, USENIX Annual Technical Conference.

[7]  Jack Dongarra,et al.  A Proposal for User-Level Failure Mitigation in the MPI-3 Standard , 2012 .

[8]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[9]  Anthony Skjellum,et al.  MPI/RT-an emerging standard for high-performance real-time systems , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[10]  Morris A. Jette Performance Characteristics of Gang Scheduling in Multiprogrammed Environments , 1997, SC.

[11]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[13]  Miguel Correia,et al.  Byzantine Fault-Tolerant MapReduce: Faults are Not Just Crashes , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[14]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[15]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[16]  Changjun Jiang,et al.  FlexSlot: Moving Hadoop Into the Cloud with Flexible Slot Management , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[18]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[21]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Sara Bouchenak,et al.  MRBS: Towards Dependability Benchmarking for Hadoop MapReduce , 2012, Euro-Par Workshops.

[23]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[24]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[25]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[26]  Larry Rudolph,et al.  Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..

[27]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[28]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[29]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[30]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[31]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[32]  Anand Raghunathan,et al.  ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters , 2014, USENIX Annual Technical Conference.

[33]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[34]  Thomas Hérault,et al.  A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.

[35]  Satish K. Tripathi,et al.  Parallel and Distributed Computing Handbook , 1995 .

[36]  M. Balazinska,et al.  An analysis of Hadoop usage in scientific workloads , 2013 .

[37]  Roy H. Campbell,et al.  Resource Provisioning Framework for MapReduce Jobs with Performance Goals , 2011, Middleware.