A System for Fault-Tolerance Execution of Data and Compute Intensive Programs over a Network of Workstations

A well known structuring technique for a wide class of parallel applications is the bag of tasks, which allows a computation to be partitioned dynamically between a collection of concurrent processes. This paper describes a fault-tolerant implementation of this structure using atomic actions (atomic transactions) to operate on persistent objects, which are accessed in a distributed setting via a Remote Procedure Call (RPC). The system developed is suited to parallel execution of data and compute intensive programs that require persistent storage and fault tolerance facilities. The suitability of the system is examined in the context of the measured performance of three specific applications; ray tracing, matrix multiplication and Cholesky factorization. The system developed runs on stock hardware and software platforms, specifically UNIX, C++.

[1]  S. K. Shrivastava,et al.  Fault-Tolerant Execution of Computationally and Storage Intensive Parallel Programs Over a Network of Workstations: A Case Study , 1995 .

[2]  A. Benzoni,et al.  Concurrent matrix factorizations on workstation networks , 1994 .

[3]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[4]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Kenneth P. Birman,et al.  Using the ISIS resource manager for distributed, fault-tolerant computing , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[6]  Santosh K. Shrivastava,et al.  The Design and Implementation of Arjuna , 1995, Comput. Syst..

[7]  Henri E. Bal,et al.  Transparent fault-tolerance in parallel Orca programs , 1992 .

[8]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[9]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[10]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[11]  Karpjoo Jeong,et al.  Fault-tolerant Parallel Processing Combining Linda, Checkpointing, and Transactions , 1996 .

[12]  James Antony Smith,et al.  Fault-tolerant parallel applications using a network of workstations , 1997 .

[13]  Philip A. Bernstein,et al.  Implementing recoverable requests using queues , 1990, SIGMOD '90.

[14]  Nicholas Carriero,et al.  How to write parallel programs - a first course , 1990 .

[15]  Richard D. Schlichting,et al.  Supporting Fault-Tolerant Parallel Programming in Linda , 1995, IEEE Trans. Parallel Distributed Syst..

[16]  Henri E. Bal Fault-tolerant parallel programming in Argus , 1992, Concurr. Pract. Exp..