论文信息 - Fault-tolerant parallel applications using queues and actions

Fault-tolerant parallel applications using queues and actions

There are many techniques supporting execution of large computations over a network of workstations (NOW) but data intensive computations are usually run on high performance parallel machines. A NOW comprising individual user's machines typically has a low performance interconnect and suffers arbitrary changes of availability. Exploiting such resources to execute data intensive computations is difficult but even in a more constrained environment there is an unfulfilled need for fault-tolerance. The structuring approach presented fulfills this need. Performance exceeding 100 Mflop/s is demonstrated for large fault-tolerant out of core examples of matrix multiplication and Cholesky factorisation using five 133 MHz Pentium compute machines.

Santosh K. Shrivastava | J. A. Smith | S. Shrivastava

[1] Santosh K. Shrivastava,et al. Rajdoot: A Remote Procedure Call Mechanism Supporting Orphan Detection and Killing , 1988, IEEE Trans. Software Eng..

[2] Partha Dasgupta,et al. Parallel processing on networks of workstations: a fault-tolerant, high performance approach , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[3] A. Benzoni,et al. Concurrent matrix factorizations on workstation networks , 1994 .

[4] Jack J. Dongarra,et al. Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5] Richard D. Schlichting,et al. Supporting Fault-Tolerant Parallel Programming in Linda , 1995, IEEE Trans. Parallel Distributed Syst..

[6] Santosh K. Shrivastava,et al. The Design and Implementation of Arjuna , 1995, Comput. Syst..

[7] David L. Presotto,et al. Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[8] Vaidy S. Sunderam,et al. Parallel I/o as a pArallel Application , 1995, Int. J. High Perform. Comput. Appl..

[9] Garth A. Gibson,et al. RAID: high-performance, reliable secondary storage , 1994, CSUR.

[10] Kenneth P. Birman,et al. Using the ISIS resource manager for distributed, fault-tolerant computing , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[11] Santosh K. Shrivastava,et al. A System for Fault-Tolerance Execution of Data and Compute Intensive Programs over a Network of Workstations , 1996, Euro-Par, Vol. I.

[12] Gene H. Golub,et al. Matrix computations , 1983 .

[13] Miron Livny,et al. Managing Checkpoints for Parallel Programs , 1996, JSSPP.

[14] Ravi Mirchandaney,et al. Experiences with networked parallel computing , 1995, Concurr. Pract. Exp..

[15] James Antony Smith,et al. Fault-tolerant parallel applications using a network of workstations , 1997 .

[16] Santosh K. Shrivastava,et al. Performance of Fault-Tolerant Data and Compute Intensive Programs over a Network of Workstations , 1998, Theor. Comput. Sci..

[17] Henri E. Bal. Fault-tolerant parallel programming in Argus , 1992, Concurr. Pract. Exp..

[18] Jack J. Dongarra,et al. The PVM Concurrent Computing System: Evolution, Experiences, and Trends , 1994, Parallel Comput..

[19] Lawrence Snyder,et al. HIGHLY PARALLEL COMPUTING. , 1984 .

[20] Peter Steenkiste,et al. Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[21] Nicholas Carriero,et al. How to write parallel programs - a first course , 1990 .

[22] Philip A. Bernstein,et al. Implementing recoverable requests using queues , 1990, SIGMOD '90.

[23] W. Kent Fuchs,et al. Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[24] Andreas Reuter,et al. Transaction Processing: Concepts and Techniques , 1992 .

[25] Henri E. Bal,et al. Transparent fault-tolerance in parallel Orca programs , 1992 .

[26] Alok N. Choudhary,et al. High-performance I/O for massively parallel computers: problems and prospects , 1994, Computer.

[27] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[28] Alok Choudhary,et al. VIP-FS: a VIrtual, Parallel File System for high performance parallel and distributed computing , 1995, Proceedings of 9th International Parallel Processing Symposium.

[29] Partha Dasgupta,et al. CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[30] Ozalp Babaoglu,et al. Understanding Non-Blocking Atomic Commitment , 1993 .

[31] Karpjoo Jeong,et al. Fault-tolerant Parallel Processing Combining Linda, Checkpointing, and Transactions , 1996 .

[32] William E. Weihl,et al. Implementation of resilient, atomic data types , 1985, TOPL.