HADAB: Enabling Fault Tolerance in Parallel Applications Running in Distributed Environments

The development of scientific software, reliable and efficient, in distributed computing environments, requires the identification and the analysis of issues related to the design and the deployment of algorithms for high-performance computing architectures and their integration in distributed contexts. In these environments, indeed, resources efficiency and availability can change unexpectedly because of overloading or failure i.e. of both computing nodes and interconnection network. The scenario described above, requires the design of mechanisms enabling the software to "survive" to such unexpected events by ensuring, at the same time, an effective use of the computing resources. Although many researchers are working on these problems for years, fault tolerance, for some classes of applications is an open matter still today. Here we focus on the design and the deployment of a checkpointing/migration system to enable fault tolerance in parallel applications running in distributed environments. In particular we describe details about HADAB, a new hybrid checkpointing strategy, and its deployment in a meaningful case study: the PETSc Conjugate Gradient algortithm implementation. The related testing phase has been performed on the University of Naples distributed infrastructure (S.Co.P.E. infrastructure).

[1]  Raja Nassar,et al.  Availability modeling and analysis on high performance cluster computing systems , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[2]  Christian Engelmann,et al.  Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.

[3]  E. Hung ELEC 6062 Project Report Fault Tolerance and Checkpointing Schemes for Clusters of Workstations , 1999 .

[4]  Chao Wang,et al.  A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[6]  Jack Dongarra,et al.  Building fault surviv-able mpi programs with ft-mpi using diskless-checkpointing , 2005 .

[7]  Ahmed Al-Nazer,et al.  On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis , 2005 .

[8]  Luís Moura Silva,et al.  The performance of coordinated and independent checkpointing , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[9]  Lui Sha,et al.  Process resurrection: a fast recovery mechanism for real-time embedded systems , 2005, 11th IEEE Real Time and Embedded Technology and Applications Symposium.

[10]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[11]  Victoria E. Howle,et al.  Fault Tolerance in Large-Scale Scientific Computing , 2006, Parallel Processing for Scientific Computing.

[12]  Almerico Murli,et al.  Monitoring and Migration of a PETSc-based Parallel Application for Medical Imaging in a Grid computing PSE , 2006, Grid-Based Problem Solving Environments.

[13]  Sathish S. Vadhiyar,et al.  SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..

[14]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[15]  George Bosilca,et al.  Algorithmic Based Fault Tolerance Applied to High Performance Computing , 2008, ArXiv.

[16]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.