Performance evaluation of checksum-based ABFT

In algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. Most of the previous studies that compared ABFT schemes considered only error detection and correction capabilities. Some previous studies looked at the overhead but no previous work compared different recovery schemes for data processing applications considering throughput as the main metric. We compare the performance of two recovery schemes: recomputing and ABFT correction, for different error rates. We consider errors that occur during computation as well as those that occur during error detection, location and correction processes. A metric for performance evaluation of different design alternatives is defined. Results show that multiple error correction using ABFT has poorer performance than single error correction even at high error rates. We also present, implement and evaluate early detection in ABFT. In early detection, we try to detect the errors that occur in the checksum calculation before starting the actual computation. Early detection improves throughput in cases of intensive computations and cases of high error rates.

[1]  Daniel S. Katz,et al.  Detailed radiation fault modeling of the Remote Exploration and Experimentation (REE) first generation testbed architecture , 2000, 2000 IEEE Aerospace Conference. Proceedings (Cat. No.00TH8484).

[2]  D. P. Siewiorek,et al.  Evaluation and comparison of fault-tolerant software techniques , 1993 .

[3]  Franklin T. Luk,et al.  A Linear Algebraic Model of Algorithm-Based Fault Tolerance , 1988, IEEE Trans. Computers.

[4]  J. A. Abraham,et al.  An object-oriented approach for implementing algorithm-based fault tolerance , 1993, Proceedings of Phoenix Conference on Computers and Communications.

[5]  João Gabriel Silva,et al.  Algorithm based fault tolerance versus result-checking for matrix computations , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[6]  Ann T. Tai,et al.  Performability enhancement of fault-tolerant software , 1993 .

[7]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[8]  Emina I. Milovanovic,et al.  Fault-tolerant matrix inversion on processor array , 1992 .

[9]  Marie Danielle Beaudry Performance considerations for the reliability analysis of computing systems , 1978 .

[10]  R. R. Some,et al.  REE: a COTS-based fault tolerant parallel processing supercomputer for spacecraft onboard scientific data analysis , 1999, Gateway to the New Millennium. 18th Digital Avionics Systems Conference. Proceedings (Cat. No.99CH37033).

[11]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.