Disaster Survival Guide in Petascale Computing: An Algorithmic Approach

1 Disaster Survival Guide in Petascale Computing: An Algorithmic Approach 3 Jack J. Dongarra, Zizhong Chen, George Bosilca, and Julien Langou 1.1 FT-MPI: A fault tolerant MPI implementation . . . . . . . . 6 1.1.1 FT-MPI Overview . . . . . . . . . . . . . . . . . . . . 6 1.1.2 FT-MPI: A Fault Tolerant MPI Implementation . . . 6 1.1.3 FT-MPI Usage . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Application Level Diskless Checkpointing . . . . . . . . . . . 8 1.2.1 Neighbor-Based Checkpointing . . . . . . . . . . . . . 10 1.2.2 Checksum-Based Checkpointing . . . . . . . . . . . . . 11 1.2.3 Weighted-Checksum-Based Checkpointing . . . . . . . 13 1.3 A Fault Survivable Iterative Equation Solver . . . . . . . . . 17 1.3.1 Preconditioned Conjugate Gradient Algorithm . . . . 17 1.3.2 Incorporating Fault Tolerance into PCG . . . . . . . . 18 1.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 21 1.4.1 Performance of PCG with Different MPI Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.4.2 Performance Overhead of Taking Checkpoint . . . . . 22 1.4.3 Performance Overhead of Performing Recovery . . . . 24 1.4.4 Numerical Impact of Round-Off Errors in Recovery . . 26 1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 28

[1]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  Luís Moura Silva,et al.  An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[3]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[4]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[5]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[6]  Jack Dongarra,et al.  Top500 Supercomputer Sites - 13th edition , 1998 .

[7]  Zizhong Chen,et al.  Self-adapting software for numerical linear algebra and LAPACK for clusters , 2003, Parallel Comput..

[8]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[9]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[10]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[11]  Jack Dongarra,et al.  Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .

[12]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[13]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[14]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[15]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[16]  Zizhong Chen,et al.  Condition Numbers of Gaussian Random Matrices , 2005, SIAM J. Matrix Anal. Appl..

[17]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[18]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[19]  Jack J. Dongarra,et al.  Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[20]  Nitin H. Vaidya,et al.  A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.

[21]  Tzi-cker Chiueh,et al.  Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[22]  A. Edelman Eigenvalues and condition numbers of random matrices , 1988 .

[23]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[24]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[25]  Jack Dongarra,et al.  Fault-tolerant matrix operations for parallel and distributed systems , 1996 .

[26]  Christian Engelmann,et al.  Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.

[27]  Zizhong Chen,et al.  Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..