Replicating statement execution for fault detection on distributed memory multiprocessors

A compiler-assisted methodology is proposed for fault detection on distributed-memory systems. Selected instances of program statements are replicated in a way that ensures appropriate coverage. Replication strategies for the detection of permanent and transient faults are presented. These strategies use idle processor times for replicating statement execution whenever possible. Two approaches are also discussed for implementing the proposed strategies on single-program multiple-data parallel execution platforms. The first approach replicates program statements through source-to-source program transformations while the second approach achieves the replication of program statements indirectly by replicating data on multiple processors.

[1]  Prithviraj Banerjee,et al.  Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors , 1990, IEEE Trans. Computers.

[2]  Ken Kennedy,et al.  An Overview of the Fortran D Programming System , 1991, LCPC.

[3]  Arun K. Somani,et al.  Efficient utilization of spare capacity for fault detection and location in multiprocessor systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[4]  Philip J. Hatcher,et al.  Compiling SIMD programs for MIMD architectures , 1990, Proceedings. 1990 International Conference on Computer Languages.

[5]  Jacob A. Abraham,et al.  Compiler-assisted static checkpoint insertion , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[6]  Douglas M. Blough,et al.  Fault tolerance in super-scalar and vliw processors , 1991 .

[7]  Krishan K. Sabnani,et al.  Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems , 1989, IEEE Trans. Computers.

[8]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[9]  Rami Melhem,et al.  Compiler assisted fault detection for distributed-memory systems , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[10]  Charles Koelbel,et al.  Supporting shared data structures on distributed memory architectures , 1990, PPOPP '90.

[11]  W. Kent Fuchs,et al.  Branch recovery with compiler-assisted multiple instruction retry , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[12]  Rami G. Melhem,et al.  Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems , 1996, IEEE Trans. Parallel Distributed Syst..