Practical Fault-Tolerant Framework for eScience Infrastructure

Many areas of science currently use computing resources as a important part of their research, and many research groups adopt cluster architecture to use them efficiently and manage them easily. Therefore, faulttolerance becomes a very important property for the computing resources. However, fault-tolerant systems have not yet been widely adopted because they are either hard to deploy, hard to use, hard to manage, hard to maintain, or hard to justify. This paper proposes a practical fault-tolerant system for eScience infrastructures. Our system uses checkpoint/ restart mechanism for fault-tolerance, and provides a easy mechanism to integrate with Grid services widely used in eScience. Additionally, we run rigorous tests using scientific applications to verify that our system can be used in clusters. We also describe improvements made to our system to solve various problems that arose when deploying it on a cluster. The experimental results show that not only does our system conform to various types of running environment well, but that it can also be practically deployed in clusters.

[1]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[3]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[4]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[5]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[6]  Shigeo Maruyama,et al.  Surface Phenomena of Molecular Clusters by Molecular Dynamics Method , 1996 .

[7]  Kwang Jin Oh,et al.  A general purpose parallel molecular dynamics simulation program , 2006, Comput. Phys. Commun..

[8]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[9]  Ravishankar K. Iyer,et al.  Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[10]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[11]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[12]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[13]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[14]  Heon Young Yeom,et al.  Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3) , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[15]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[16]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .