A data-centric approach to checksum reuse for array-intensive applications

Soft errors are transient faults that occur in VLSI circuits due to external radiation and affect the logic states of sensitive components. While many systems implement hardware-based protection techniques like ECC and other approaches to ensure an acceptable level of robustness against these errors, such solutions are generally very rigid and costly. Recent research discussed checksum-based software solutions that can be used in the context of array-intensive computations. While a checksum-based scheme can be more flexible than a hardware-based approach to reliability, it can also bring significant runtime overheads. Focusing on array-intensive applications, this paper proposes a compiler-directed data-centric strategy that maximizes reuse of checksums. A unique characteristic of the proposed scheme is that it can work with a given checksum assignment, and automatically - under compiler guidance - restructures the entire application code to maximize checksum reuse. This scheme can reduce checksum recomputing even further by inter-procedural checksum reuse. Our experiments clearly show that the proposed approach reduces the number of checksum calculations required by the previous work.

[1]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[2]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[3]  G. R. Srinivasan Modeling the cosmic-ray-induced soft-error rate in integrated circuits: An overview , 1996, IBM J. Res. Dev..

[4]  Prithviraj Banerjee,et al.  Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube Multiprocessors , 1990, IEEE Trans. Software Eng..

[5]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Amber Roy-Chowdhury,et al.  A Fault-Tolerant Parallel Algorithm for Iterative Solution of the Laplace Equation , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[7]  Ahmad A. Al-Yamani,et al.  Performance evaluation of checksum-based ABFT , 2001, Proceedings 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[8]  Wei Zhang,et al.  Interprocedural optimizations for improving data cache performance of array-intensive embedded applications , 2003, DAC '03.

[9]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[10]  Prithviraj Banerjee,et al.  Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors , 1990, IEEE Trans. Computers.

[11]  Suku Nair,et al.  Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.

[12]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[13]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988, Wiley interscience series in discrete mathematics and optimization.

[14]  Ken Kennedy,et al.  A Methodology for Procedure Cloning , 1993, Computer languages.

[15]  Miroslaw Malek,et al.  A Fault-Tolerant FFT Processor , 1988, IEEE Trans. Computers.

[16]  Jacob A. Abraham,et al.  Fault-Tolerant Matrix Operations On Multiple Processor Systems Using Weighted Checksums , 1984, Optics & Photonics.

[17]  Prithviraj Banerjee,et al.  Algorithms-Based Fault Detection for Signal Processing Applications , 1990, IEEE Trans. Computers.

[18]  Kuang-Hua Huang Fault-tolerant algorithms for multiple processor systems , 1983 .

[19]  Prithviraj Banerjee,et al.  Algorithm-Based Error Detection Schemes for Iterative Solution of Partial Differential Equations , 1996, IEEE Trans. Computers.

[20]  Lionel M. Ni,et al.  Reliable Distributed Sorting Through the Application-Oriented Fault Tolerance Paradigm , 1989, IEEE Trans. Parallel Distributed Syst..

[21]  Monica S. Lam,et al.  Interprocedural Analysis for Parallelization , 1995, LCPC.

[22]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.