A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.

[1]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[2]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[3]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Richard L. Graham,et al.  Preserving Collective Performance across Process Failure for a Fault Tolerant MPI , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[6]  Christian Engelmann,et al.  Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.

[7]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..

[8]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[9]  Corporate The MPI Forum,et al.  MPI: a message passing interface , 1993, Supercomputing '93.

[10]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[11]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[12]  Dale Skeen,et al.  Nonblocking commit protocols , 1981, SIGMOD '81.

[13]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[14]  Yoav Raz The Dynamic Two Phase Commitment (D2PC) Protocol , 1995, ICDT.

[15]  Bruce G. Lindsay,et al.  Transaction management in the R* distributed database management system , 1986, TODS.

[16]  Forum Mpi MPI: A Message-Passing Interface , 1994 .

[17]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.