On-line error detection through data duplication in distributed-memory systems

Abstract The scalability of distributed-memory systems makes them attractive for massively parallel computations. Howevcr, the issues of fault detection and fault tolerance are critical in such systems since the probability of having faulty components increases with the number of processors. We propose a methodology for fault detection on distributed-memory systems through compiler support. The single-program multiple-data (SPMD) execution model is extended to execute programs in which selecta data items are duplicated on different processors. During execution, whenever the values of duplicated data are computed, they are compared for the purpose of error detection. In other words, fault detection is controlled by the duplication of data. The proposed compiler-assisted fault-detection technique does not require any specialized hardware and attempts to exploit the idle capacity of the system whenever possible. After presenting the principles of duplicated computation through data duplication and the corresponding compiler algorithms, we focus on regular loops to exploit the idle processors for fault-detection purpose. We present experimental results that demonstrate the feasibility of the proposed approach.

[1]  Eric C. Cooper Replicated distributed programs , 1985, SOSP 1985.

[2]  Melvin A. Breuer,et al.  Roving Emulation as a Fault Detection Mechanism , 1986, IEEE Transactions on Computers.

[3]  W. Kent Fuchs,et al.  Branch recovery with compiler-assisted multiple instruction retry , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[4]  Tony P. Ng,et al.  Replicated transactions , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[5]  Douglas M. Blough,et al.  Performance Analysis of a Generalized Concurrent Error Detection Procedure , 1990, IEEE Trans. Computers.

[6]  Krishan K. Sabnani,et al.  Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems , 1989, IEEE Trans. Computers.

[7]  Arun K. Somani,et al.  Efficient utilization of spare capacity for fault detection and location in multiprocessor systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[8]  D J Evans,et al.  Parallel processing , 1986 .

[9]  Chun Gong Compiler-assisted approaches to fault detection on distributed-memory systems , 1996 .

[10]  Douglas M. Blough,et al.  Fault tolerance in super-scalar and vliw processors , 1991 .

[11]  Peter A. Barrett,et al.  Software Fault Tolerance: An Evaluation , 1985, IEEE Transactions on Software Engineering.

[12]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[13]  Prithviraj Banerjee,et al.  Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors , 1990, IEEE Trans. Computers.

[14]  Rami G. Melhem,et al.  Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[15]  Rajiv Gupta Compiler optimizations for distributed-memory programs , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[16]  John Feo,et al.  An analysis of the computational and parallel complexity of the Livermore Loops , 1988, Parallel Comput..

[17]  Barbara M. Chapman,et al.  High performance Fortran without templates: an alternative model for distribution and alignment. , 1993, PPOPP '93.