SYSTEM SUPPORT FOR CHECKPOINT AND RESTART OF CHARM++ AND AMPI APPLICATIONS

As both modern supercomputers and new generation scientific computing applications grow in size and complexity, the probability of system failure rises commensurately. Making parallel computing fault tolerant has become an increasingly important issue. Checkpoint/restart mechanism provides for fault tolerance capability as well as other benefits for parallel programmers. This thesis describes the on-disk checkpoint/restart mechanism for Charm++ and Adaptive MPI programming framework, its motivation, design, and implementation. This mechanism has proven to be useful in practice and can also be extended to implement other fault tolerant techniques.

[1]  Gabriel Antoniu,et al.  An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System , 1999, IPPS/SPDP Workshops.

[2]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[3]  Laxmikant V. Kalé,et al.  Multiparadigm, Multilingual Interoperability: Experience with Converse , 1998, IPPS/SPDP Workshops.

[4]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[5]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[6]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[7]  Laxmikant V. Kalé,et al.  Supporting dynamic parallel object arrays , 2003, Concurr. Comput. Pract. Exp..

[8]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[9]  Hai Jin,et al.  Distributed Checkpointing on Clusters with Dynamic Striping and Staggering , 2002, ASIAN.

[10]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[11]  Laxmikant V. Kalé,et al.  Converse: an interoperable framework for parallel programming , 1996, Proceedings of International Conference on Parallel Processing.

[12]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[13]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[14]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[15]  Larry Rudolph,et al.  Parallel Job Scheduling: Issues and Approaches , 1995, JSSPP.

[16]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[17]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[18]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[19]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[20]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[21]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.