Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System

Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. In these systems, it is certain that host failures and other faults will be a common occurrence. Unfortunateb, most parallel processing systems have not been designed with fault-tolerance in mind. Mentat is a high-performance objec t-oriented parallel processing system that is based on an extension of the data-flow model. The functional nature of data-flow enabies both parallelism and faulttolerance. In this paper, we exploit the data-flow underpinning of Mentat to provide easy-to-use and transparent fault-tolerance. We present results on both a small-scale network and a wide-area heterogeneous environment that consists of three sites: the National Center for Supercomputing Applications, the University of Mrginia and the NASA Langley Research Center.

[1]  Dennis Shasha,et al.  PLinda 2.0: a transactional/checkpointing approach to fault tolerant Linda , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[2]  Tilak Agerwala,et al.  Data Flow Systems: Guest Editors' Introduction , 1982, Computer.

[3]  Stuart M. Wheater,et al.  Implementing fault-tolerant distributed applications using objects and multi-coloured actions , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  Andrew S. Grimshaw The Mentat Computation Model Data-Driven Support for Object-Oriented Parallel Processing , 1993 .

[6]  Arthur H. Veen,et al.  Dataflow machine architecture , 1986, CSUR.

[7]  Robert G. Babb,et al.  Parallel Processing with Large-Grain Data Flow Techniques , 1984, Computer.

[8]  Jack Dongarra,et al.  HeNCE: graphical development tools for network-based concurrent computing , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..