Compiler-assisted generation of error-detecting parallel programs

We have developed an automated a compile time approach to generating error-detecting parallel programs. The compiler is used to identify statements implementing affine transformations within the program and to automatically insert code for computing, manipulating, and comparing checksums in order to detect data errors at runtime. Statements which do not implement affine transformations are checked by duplication. Checksums are reused from one loop to the next if this is possible, rather than recomputing checksums for every statement. A global dataflow analysis is performed in order to determine points at which checksums need to be recomputed. We also use a novel method of specifying the data distributions of the check data using data distribution directives so that the computations on the original data, and the corresponding check computations are performed on different processors. Results on the time overhead and error coverage of the error detecting parallel programs over the original programs are presented on an Intel Paragon distributed memory multicomputer.

[1]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[2]  Prithviraj Banerjee,et al.  Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors , 1990, IEEE Trans. Software Eng..

[3]  Douglas M. Blough,et al.  Fault tolerance in super-scalar and vliw processors , 1991 .

[4]  Prithviraj Banerjee,et al.  Low Cost Concurrent Error Detection in a VLIW Architecture Using Replicated Instructions , 1992, ICPP.

[5]  M. Tsunoyama,et al.  A fault-tolerant FFT processor , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[6]  Vijay Balasubramanian The analysis and synthesis of efficient algorithm-based error detection schemes for hypercube multiprocessors , 1992 .

[7]  Prithviraj Banerjee,et al.  Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors , 1990, IEEE Trans. Computers.

[8]  Gary M. Koob,et al.  Foundations of Dependable Computing , 1994 .

[9]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[10]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[11]  Rami Melhem,et al.  Compiler assisted fault detection for distributed-memory systems , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[12]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[13]  Amber Roy-Chowdhury,et al.  Algorithm-based fault location and recovery for matrix computations , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[14]  John A. Chandy,et al.  The Paradigm Compiler for Distributed-Memory Multicomputers , 1995, Computer.

[15]  Utpal Banerjee Loop Parallelization , 1994, Springer US.

[16]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[17]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[18]  Rami Melhem,et al.  Replicating statement execution for fault detection on distributed memory multiprocessors , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[19]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[20]  Milind Girkar,et al.  Parafrase-2: an Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors , 1989, Int. J. High Speed Comput..