Fault Tolerance via Replication in Coarse Grain Data-Flow

Abstract : Recent advances in network technology promise to make gigabit-per-second bandwidth between remote hosts a reality in the near future. This increase in bandwidth paves the way for increased exploitation of distributed computing resources. Coupled with advances in distributed memory parallel compiler technology, there is strong reason to believe that wide-area distributed parallel processing will be an increasingly popular and important programming paradigm. Parallelizing and distributing program sub-tasks has the potential to increase performance for many applications while also improving the overall utilization of system resources. Unfortunately, there is a downside. When a program is partitioned into sub-tasks, each sub-task is distributed to potentially a different processor. As the number of processors employed by an application increases so does the chance that the application will fail due to a host/ processor failure. At the University of Virginia, we have experienced first hand the problems caused by host failures in distributed systems while developing and using a prototype for the Legion project [13][14]. The objective of Legion is to construct the software environment to enable a nation-wide or world-wide virtual computer capable of supporting distributed and parallel applications. Our current prototype, which we call the Campus-Wide Virtual Computer (CWVC), contains a mix of over 90 workstations and an IBM SP-2 multicomputer. Even in this relatively small environment, we are frequently experiencing host failures. On the scale of the envisioned nation-wide system, host failures will simply be a fact of life and must be dealt with accordingly. User applications, especially those that are critical or are composed of many distributed components, must be resilient to host failures. Fortunately developing fault tolerant parallel applications does not need to be difficult.

[1]  John W. Stoughton,et al.  A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration. Ph.D. Thesis Report, 1 Jan. - 31 Dec. 1992 , 1993 .

[2]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[3]  Robert G. Babb,et al.  Parallel Processing with Large-Grain Data Flow Techniques , 1984, Computer.

[4]  Partha Dasgupta,et al.  CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[5]  Tilak Agerwala,et al.  Data Flow Systems: Guest Editors' Introduction , 1982, Computer.

[6]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[7]  A. Grimshaw The Mentat Computation Model-Data-Driven Support for Dynamic Object-Oriented Parallel Processing 1 , 1993 .

[8]  James C. French,et al.  Legion: The Next Logical Step Toward a Nationwide Virtual Computer , 1994 .

[9]  Richard D. Schlichting,et al.  Supporting Fault-Tolerant Parallel Programming in Linda , 1995, IEEE Trans. Parallel Distributed Syst..

[10]  Andrew S. Grimshaw,et al.  Portable run-time support for dynamic object-oriented parallel processing , 1996, TOCS.

[11]  Lorenzo Alvisi,et al.  Paralex: an environment for parallel programming in distributed systems , 1991, ICS '92.

[12]  Henri E. Bal,et al.  Transparent fault-tolerance in parallel Orca programs , 1992 .

[13]  A. S. Grimshaw,et al.  FALCON: A distributed scheduler for MIMD architectures , 1991 .

[14]  Andrew S. Grimshaw,et al.  Campus-Wide Computing : Early Results Using Legion At the University of Virginia , 1997, Int. J. High Perform. Comput. Appl..

[15]  Santosh K. Shrivastava,et al.  Replicated K-resilient objects in Arjuna , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[16]  Arthur H. Veen,et al.  Dataflow machine architecture , 1986, CSUR.

[17]  James C. Browne,et al.  Experimental Evaluation of a Reusability-Oriented Parallel Programming Environment , 1990, IEEE Trans. Software Eng..

[18]  Dennis Shasha,et al.  PLinda 2.0: a transactional/checkpointing approach to fault tolerant Linda , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[19]  Andrew S. Grimshaw,et al.  Easy-to-use object-oriented parallel processing with Mentat , 1993, Computer.

[20]  Amr El Abbadi,et al.  Implementing Fault-Tolerant Distributed Objects , 1985, IEEE Transactions on Software Engineering.

[21]  Jack Dongarra,et al.  HeNCE: graphical development tools for network-based concurrent computing , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..