Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand

As more and more clusters with thousands of nodes are being deployed for high performance computing (HPC), fault tolerance in cluster environments has become a critical requirement. Checkpointing and rollback recovery is a common approach to achieve fault tolerance. Although widely adopted in practice, coordinated checkpointing has a known limitation on scalability. Severe contention for bandwidth to storage system can occur as a large number of processes take a checkpoint at the same time, resulting in an extremely long checkpointing delay for large parallel applications. In this paper, we propose a novel group-based checkpointing design to alleviate this scalability limitation. By carefully scheduling the MPI processes to take checkpoints in smaller groups, our design reduces the number of processes simultaneously taking checkpoints, while allowing those processes not taking checkpoints to proceed with computation. We implement our design and carry out a detailed evaluation with micro-benchmarks, HPL, and the parallel version of a data mining toolkit, MotifMiner. Experimental results show our group-based checkpointing design can reduce the effective delay for checkpointing significantly, up to 78% for HPL and up to 70% for MotifMiner.

[1]  Chao Wang,et al.  A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Jack Dongarra,et al.  Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .

[3]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[5]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[6]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[8]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[9]  F. Cappello,et al.  Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[10]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[11]  Dhabaleswar K. Panda,et al.  Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[12]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[13]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[14]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[15]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[16]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[17]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[18]  Thomas Hérault,et al.  Impact of event logger on causal message logging protocols for fault tolerant MPI , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[19]  Wednesday September,et al.  2007 International Conference on Parallel Processing , 2007 .