Algorithm-based diskless checkpointing for fault tolerant matrix operations

The paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "network of workstations" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.<<ETX>>

[1]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[2]  David Gelernter,et al.  Supercomputing out of recycled garbage: preliminary experience with Piranha , 1992, ICS '92.

[3]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[4]  David Cummings,et al.  Checkpoint/rollback in a distributed system using coarse-grained dataflow , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[5]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[6]  Randy H. Katz,et al.  Failure correction techniques for large disk arrays , 1989, ASPLOS III.

[7]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[8]  Jack Dongarra,et al.  Heterogeneous network computing , 1991 .

[9]  Flaviu Cristian,et al.  A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[10]  Walter A. Burkhard,et al.  Disk array storage system reliability , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[11]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[12]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[13]  M. Moura Silva,et al.  Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[14]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[15]  Willy Zwaenepoel,et al.  Measured Performance of Consistent Checkpointing , 1992 .

[16]  E. N. Elnozahy,et al.  Replicated distributed processes in Manetho , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[17]  Henri E. Bal,et al.  Transparent fault-tolerance in parallel Orca programs , 1992 .

[18]  Nitin H. Vaidya Consistent Logical Checkpointing , 1994 .

[19]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[20]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[21]  David B. Johnsonandwillyzwaenepoel Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1990 .

[22]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[23]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[24]  M. Blaum,et al.  EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[25]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[26]  Miroslaw Malek,et al.  Space/Time Overhead Analysis and Experiments with Techniques for Fault Tolerance , 1993 .

[27]  Jack J. Dongarra,et al.  Solving linear systems on vector and shared memory computers , 1990 .

[28]  Jian Xu,et al.  Adaptive independent checkpointing for reducing rollback propagation , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[29]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[30]  James R. Russell,et al.  Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[31]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[32]  Amber Roy-Chowdhury,et al.  Algorithm-based fault location and recovery for matrix computations , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[33]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[34]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[35]  Jack Dongarra,et al.  Pvm: A Users' Guide and Tutorial for Network Parallel Computing , 1994 .

[36]  Kenneth P. Birman,et al.  Exploiting replication in distributed systems , 1990 .

[37]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[38]  Erik Seligman,et al.  Dome: Distributed Object Migration Environment , 1994 .

[39]  Phil Kearns,et al.  Rollback based on vector time , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[40]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[41]  Richard D. Schlichting,et al.  Supporting Fault-Tolerant Parallel Programming in Linda , 1995, IEEE Trans. Parallel Distributed Syst..

[42]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[43]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[44]  Kai Li,et al.  A Failure Correction Technique for Parallel Storage Devices with Minimal Device Overhead , 1994 .

[45]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[46]  Jeffrey F. Naughton,et al.  Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994 .