Scalable techniques for fault tolerant high performance computing

As the number of processors in today's parallel systems continues to grow, the mean-time-to-failure of these systems is becoming significantly shorter than the execution time of many parallel applications. It is increasingly important for large parallel applications to be able to continue to execute in spite of the failure of some components in the system. Today's long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this research, we explore scalable techniques to tolerate a small number of process failures in large scale parallel computing. The goal of this research is to develop scalable fault tolerance techniques to help to make future high performance computing applications self-adaptive and fault survivable. The fundamental challenge in this research is scalability. To approach this challenge, this research (1) extended existing diskless checkpointing techniques to enable them to better scale in large scale high performance computing systems; (2) designed checkpoint-free fault tolerance techniques for linear algebra computations to survive process failures without checkpoint or rollback recovery; (3) developed coding approaches and novel erasure correcting codes to help applications to survive multiple simultaneous process failures. The fault tolerance schemes we introduce in this dissertation are scalable in the sense that the overhead to tolerate a failure of a fixed number of processes does not increase as the number of total processes in a parallel system increases. Two prototype examples have been developed to demonstrate the effectiveness of our technique. In the first example, we developed a fault survivable conjugate gradient, solver that is able to survive multiple simultaneous process failures with negligible overhead. In the second example, we incorporated our fault tolerance technique into the ScaLAPACK/PBLAS matrix-matrix multiplication code to evaluate the overhead, survivability, and scalability. Theoretical analysis indicates that, to survive a fixed number of process failures, the fault tolerance overhead (without recovery) for matrix-matrix multiplication decreases to zero as the total number of processes (assuming a fixed amount of data per process) increases to infinity. Experimental results demonstrate that the checkpoint-free fault tolerance technique introduces surprisingly low overhead even when the total number of processes used in the application is small.

[1]  Suku Nair,et al.  Efficient Techniques for the Analysis of Algorithm-Based Fault Tolerance (ABFT) Schemes , 1996, IEEE Trans. Computers.

[2]  Luís Moura Silva,et al.  An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[3]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[4]  Suku Nair,et al.  Real-Number Codes for Bault-Tolerant Matrix Operations On Processor Arrays , 1990, IEEE Trans. Computers.

[5]  Ian Foster,et al.  The Globus toolkit , 1998 .

[6]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[7]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[8]  Micah Beck,et al.  Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .

[9]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[10]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[11]  J.M.N. Vieira,et al.  Stable DFT codes and frames , 2003, IEEE Signal Processing Letters.

[12]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[13]  Nitin H. Vaidya,et al.  A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.

[14]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[15]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[16]  P.J.S.G. Ferreira Stability issues in error control coding in the complex field, interpolation, and frame bounds , 2000, IEEE Signal Processing Letters.

[17]  Jack Dongarra,et al.  Top500 Supercomputer Sites - 13th edition , 1998 .

[18]  Tzi-cker Chiueh,et al.  Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[19]  A. James Distributions of Matrix Variates and Latent Roots Derived from Normal Samples , 1964 .

[20]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[21]  Jack Dongarra,et al.  Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .

[22]  Alan Edelman,et al.  Tails of Condition Number Distributions , 2005, SIAM J. Matrix Anal. Appl..

[23]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[24]  Gene H. Golub,et al.  Matrix computations , 1983 .

[25]  Jean-Marc Azaïs,et al.  Upper and Lower Bounds for the Tails of the Distribution of the Condition Number of a Gaussian Matrix , 2005, SIAM J. Matrix Anal. Appl..

[26]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[27]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[28]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[29]  Ronald L. Graham,et al.  Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[30]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[31]  Jack J. Dongarra,et al.  Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[32]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[33]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[34]  Zizhong Chen,et al.  Self-adapting software for numerical linear algebra and LAPACK for clusters , 2003, Parallel Comput..

[35]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[36]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[37]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[38]  Stanislaw J. Szarek,et al.  Condition numbers of random matrices , 1991, J. Complex..

[39]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[40]  Franklin T. Luk,et al.  Algorithmic Fault Tolerance Using the Lanczos Method , 1992, SIAM J. Matrix Anal. Appl..

[41]  S. Smale On the efficiency of algorithms of analysis , 1985 .

[42]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[43]  P. D. Hough,et al.  Algorithm-dependent fault tolerance for distributed computing , 2000 .

[44]  Jack Dongarra,et al.  Fault-tolerant matrix operations for parallel and distributed systems , 1996 .

[45]  Christoforos N. Hadjicostis,et al.  Coding approaches to fault tolerance in linear dynamic systems , 2005, IEEE Transactions on Information Theory.

[46]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[47]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[48]  David B. Johnson,et al.  Distributed system fault tolerance using message logging and checkpointing , 1990 .

[49]  A. Edelman On the distribution of a scaled condition number , 1992 .

[50]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[51]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..

[52]  Werner Henkel Multiple Error Correction with Analog Codes , 1988, AAECC.

[53]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[54]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[55]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[56]  A. Edelman Eigenvalues and condition numbers of random matrices , 1988 .

[57]  Sathish S. Vadhiyar,et al.  SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..

[58]  Peter Sanders,et al.  A bandwidth latency tradeoff for broadcast and reduction , 2003, Inf. Process. Lett..