The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI.

[1]  Brian Randell System structure for software fault tolerance , 1975 .

[2]  Brian Randell System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[3]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[4]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[5]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[6]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[7]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[8]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[9]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[10]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[11]  Richard Y. Kain,et al.  Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks , 1992, IEEE Trans. Parallel Distributed Syst..

[12]  Corporate The MPI Forum MPI: a message passing interface , 1993, Supercomputing '93.

[13]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[14]  Forum Mpi MPI: A Message-Passing Interface , 1994 .

[15]  Anthony Skjellum,et al.  Extending the message passing interface (MPI) , 1994, Proceedings Scalable Parallel Libraries Conference.

[16]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[17]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[18]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[19]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[20]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[21]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[22]  William Gropp,et al.  Users guide for mpich, a portable implementation of MPI , 1996 .

[23]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[24]  Miron Livny,et al.  Managing Checkpoints for Parallel Programs , 1996, JSSPP.

[25]  Jyh-Jong Tsay,et al.  Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[26]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[27]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[28]  William Gropp,et al.  Mpi - The Complete Reference: Volume 2, the Mpi Extensions , 1998 .

[29]  William Gropp,et al.  MPI: The Complete Reference , Vol. 2 - The MPI-2 Extensions , 1998 .

[30]  William Gropp The MPI-2 extensions , 1998 .

[31]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[32]  Marvin Solomon,et al.  The evolution of Condor checkpointing , 1999 .

[33]  Fred Douglis,et al.  Mobility: Processes, Computers, and Agents , 1999 .

[34]  Leonid Oliker,et al.  System Utilization Benchmark on the Cray T3E and IBM SP , 2000, JSSPP.

[35]  Jonathan D. Trent,et al.  Astrobiology Technology Branch, NASA Ames Research Center, Moffett Field CA , 2000 .

[36]  Jack J. Dongarra,et al.  HARNESS and fault tolerant MPI , 2001, Parallel Comput..

[37]  Carsten Franke,et al.  Job Scheduling Strategies for Parallel Processing , 2002, Lecture Notes in Computer Science.

[38]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[39]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[40]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[41]  An Overview of the BlueGene/L Supercomputer , 2002 .

[42]  Steven J. Deitz,et al.  Compiler support for automatic checkpointing , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[43]  Brian W. Barrett,et al.  The system services interface (SSI) to LAM/MPI , 2003 .

[44]  Brian W. Barrett,et al.  Request progression interface (RPI) system services interface (SSI) modules for LAM/MPI , 2003 .

[45]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[46]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[47]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[48]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[49]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .