FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the fault-tolerant parallel algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed get it fault-tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.

[1]  BeguelinAdam,et al.  Application Level Fault Tolerance in Heterogeneous Networks of Workstations , 1997 .

[2]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[3]  Jack J. Dongarra,et al.  Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[4]  Xuejun Yang,et al.  The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[5]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[6]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[7]  George Bosilca,et al.  Network Fault Tolerance in Open MPI , 2007, Euro-Par.

[8]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[9]  Charng-Da Lu,et al.  Reliability challenges in large systems , 2006, Future Gener. Comput. Syst..

[10]  Tamara G. Kolda,et al.  Asynchronous Parallel Pattern Search for Nonlinear Optimization , 2001, SIAM J. Sci. Comput..

[11]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[12]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[13]  Mary Lou Soffa,et al.  Efficient computation of interprocedural definition-use chains , 1994, TOPL.

[14]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[15]  Marina C. Chen,et al.  Generating explicit communication from shared-memory program references , 1990, Proceedings SUPERCOMPUTING '90.

[16]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[17]  Xuejun Yang,et al.  Compiler-Assisted Application-Level Checkpointing for MPI Programs , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[18]  Laxmikant V. Kalé,et al.  A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[19]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[20]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[21]  Christian Engelmann,et al.  Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.

[22]  Sung-Eun Choi,et al.  Compiler-generated staggered checkpointing , 2004 .

[23]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[24]  Paul D. Hovland,et al.  Data-Flow Analysis for MPI Programs , 2006, 2006 International Conference on Parallel Processing (ICPP'06).