Fault-tolerant parallel applications using queues and actions

There are many techniques supporting execution of large computations over a network of workstations (NOW) but data intensive computations are usually run on high performance parallel machines. A NOW comprising individual user's machines typically has a low performance interconnect and suffers arbitrary changes of availability. Exploiting such resources to execute data intensive computations is difficult but even in a more constrained environment there is an unfulfilled need for fault-tolerance. The structuring approach presented fulfills this need. Performance exceeding 100 Mflop/s is demonstrated for large fault-tolerant out of core examples of matrix multiplication and Cholesky factorisation using five 133 MHz Pentium compute machines.

[1]  Santosh K. Shrivastava,et al.  Rajdoot: A Remote Procedure Call Mechanism Supporting Orphan Detection and Killing , 1988, IEEE Trans. Software Eng..

[2]  Partha Dasgupta,et al.  Parallel processing on networks of workstations: a fault-tolerant, high performance approach , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[3]  A. Benzoni,et al.  Concurrent matrix factorizations on workstation networks , 1994 .

[4]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Richard D. Schlichting,et al.  Supporting Fault-Tolerant Parallel Programming in Linda , 1995, IEEE Trans. Parallel Distributed Syst..

[6]  Santosh K. Shrivastava,et al.  The Design and Implementation of Arjuna , 1995, Comput. Syst..

[7]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[8]  Vaidy S. Sunderam,et al.  Parallel I/o as a pArallel Application , 1995, Int. J. High Perform. Comput. Appl..

[9]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[10]  Kenneth P. Birman,et al.  Using the ISIS resource manager for distributed, fault-tolerant computing , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[11]  Santosh K. Shrivastava,et al.  A System for Fault-Tolerance Execution of Data and Compute Intensive Programs over a Network of Workstations , 1996, Euro-Par, Vol. I.

[12]  Gene H. Golub,et al.  Matrix computations , 1983 .

[13]  Miron Livny,et al.  Managing Checkpoints for Parallel Programs , 1996, JSSPP.

[14]  Ravi Mirchandaney,et al.  Experiences with networked parallel computing , 1995, Concurr. Pract. Exp..

[15]  James Antony Smith,et al.  Fault-tolerant parallel applications using a network of workstations , 1997 .

[16]  Santosh K. Shrivastava,et al.  Performance of Fault-Tolerant Data and Compute Intensive Programs over a Network of Workstations , 1998, Theor. Comput. Sci..

[17]  Henri E. Bal Fault-tolerant parallel programming in Argus , 1992, Concurr. Pract. Exp..

[18]  Jack J. Dongarra,et al.  The PVM Concurrent Computing System: Evolution, Experiences, and Trends , 1994, Parallel Comput..

[19]  Lawrence Snyder,et al.  HIGHLY PARALLEL COMPUTING. , 1984 .

[20]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[21]  Nicholas Carriero,et al.  How to write parallel programs - a first course , 1990 .

[22]  Philip A. Bernstein,et al.  Implementing recoverable requests using queues , 1990, SIGMOD '90.

[23]  W. Kent Fuchs,et al.  Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[24]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[25]  Henri E. Bal,et al.  Transparent fault-tolerance in parallel Orca programs , 1992 .

[26]  Alok N. Choudhary,et al.  High-performance I/O for massively parallel computers: problems and prospects , 1994, Computer.

[27]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[28]  Alok Choudhary,et al.  VIP-FS: a VIrtual, Parallel File System for high performance parallel and distributed computing , 1995, Proceedings of 9th International Parallel Processing Symposium.

[29]  Partha Dasgupta,et al.  CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[30]  Ozalp Babaoglu,et al.  Understanding Non-Blocking Atomic Commitment , 1993 .

[31]  Karpjoo Jeong,et al.  Fault-tolerant Parallel Processing Combining Linda, Checkpointing, and Transactions , 1996 .

[32]  William E. Weihl,et al.  Implementation of resilient, atomic data types , 1985, TOPL.