Fault tolerant linear algebra: Recovering from fail-stop failures without checkpointing

Today's long running high performance computing applications typically tolerate fail-stop failures by checkpointing. While checkpointing is a very general technique and can be applied in a wide range of applications, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints. In this research, we will design highly scalable low overhead fault tolerant schemes according to the specific characteristics of an application. We will focus on linear algebra operations and re-design selected algorithms to tolerate fail-stop failures without checkpointing. We will also incorporate the developed techniques into the widely used numerical linear algebra library package ScaLAPACK.

[1]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[3]  Jack Dongarra,et al.  Fault-tolerant matrix operations for parallel and distributed systems , 1996 .

[4]  Zizhong Chen,et al.  Optimal real number codes for fault tolerant matrix operations , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Zizhong Chen,et al.  Algorithmic Cholesky factorization fault recovery , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[7]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[8]  W. Marsden I and J , 2012 .

[9]  Chao Wang,et al.  A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Zizhong Chen,et al.  Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[11]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[12]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.