Coding Techniques for Fault-Tolerant Parallel Prefix Computations in Abelian Groups

This paper presents coding techniques that can be used to provide fault tolerance to a parallel prefix computation that is performed on a binary tree of processing nodes. More specifically, we discuss how a parallel prefix computation in an arbitrary Abelian group can be protected using group homomorphisms. The proposed approach is general enough to handle a variety of group operations of interest and allows for designs ranging from simple parity schemes to full replication. Error detecting and correcting mechanisms are used solely at the leaf nodes and can capture faults at any node or link within the binary tree architecture on which the parallel prefix computation is performed. Furthermore, by tracking the propagation of errors in the binary tree, our method can identify a processing node that has permanently failed based on information from simple error detecting mechanisms at the leaf nodes.

[1]  Bruce R. Musicus,et al.  Fault-tolerant computation using algebraic homomorphisms , 1992 .

[2]  Nikolaos Gaitanis Totally Self-Checking Checkers with Separate Internal Fault Indication , 1988, IEEE Trans. Computers.

[3]  Amber Roy-Chowdhury,et al.  Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems , 1996, IEEE Trans. Computers.

[4]  M. Malek,et al.  A Fault-Tolerant Systolic Sorter , 1988, IEEE Trans. Computers.

[5]  C. Hadjicostis NON-CONCURRENT ERROR DETECTION AND CORRECTION IN FAULT-TOLERANT LINEAR FINITE-STATE MACHINES , 2002 .

[6]  G. Robert Redinbo,et al.  Generalized Algorithm-Based Fault Tolerance: Error Correction via Kalman Estimation , 1998, IEEE Trans. Computers.

[7]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[8]  Henrique Madeira,et al.  Practical issues in the use of ABFT and a new failure model , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[9]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[10]  N. Jacobson,et al.  Basic Algebra I , 1976 .

[11]  P. K. Lala Self-Checking and Fault-Tolerant Digital Design , 1995 .

[12]  Sy-Yen Kuo,et al.  Concurrent error detection and correction in real-time systolic sorting arrays , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[13]  R. Blahut Algebraic Codes for Data Transmission , 2002 .

[14]  W. W. Peterson,et al.  On Codes for Checking Logical Operations , 1959, IBM J. Res. Dev..

[15]  C. Hadjicostis On the complexity of parallelizing sequential circuits using the parallel-prefix method , 2001 .

[16]  R. Ramaswami,et al.  Book Review: Design and Analysis of Fault-Tolerant Digital Systems , 1990 .

[17]  John F. Wakerly,et al.  Error detecting codes, self-checking circuits and applications , 1978 .

[18]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[19]  D. L. Tao,et al.  Evaluating Reliability Improvements of Fault Tolerant Array Processors Using Algorithm-Based Fault Tolerance , 1997, IEEE Trans. Computers.

[20]  Eiji Fujiwara,et al.  Error-control coding for computer systems , 1989 .

[21]  Andrew M. Tyrrell Recovery blocks and algorithm-based fault tolerance , 1996, Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies.

[22]  Israel Koren Computer arithmetic algorithms , 1993 .

[23]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[24]  Thammavarapu R. N. Rao,et al.  Error coding for arithmetic processors , 1974 .

[25]  Suku Nair,et al.  Real-Number Codes for Bault-Tolerant Matrix Operations On Processor Arrays , 1990, IEEE Trans. Computers.

[26]  Rajesh K. Mansharamani Parallel Computing Using the Prefix Problem , 1995 .

[27]  G. Robert Redinbo,et al.  Algorithm-Based Fault Tolerant Synthesis for Linear Operations , 1996, IEEE Trans. Computers.

[28]  Bruce R. Musicus,et al.  Fast fault-tolerant digital convolution using a polynomial residue number system , 1993, IEEE Trans. Signal Process..

[29]  Alexandru Nicolau,et al.  The Strict Time Lower Bound and Optimal Schedules for Parallel Prefix with Resource Constraints , 1996, IEEE Trans. Computers.

[30]  Christoforos N. Hadjicostis,et al.  Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems , 2001, The Kluwer international series in engineering and computer science.

[31]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[32]  Christoforos N. Hadjicostis,et al.  Fault-tolerant computation in groups and semigroups: applications to automata, dynamic systems and Petri nets , 2002, J. Frankl. Inst..

[33]  I. Herstein,et al.  Topics in algebra , 1964 .

[34]  Henrique Madeira,et al.  Experimental evaluation of the impact of processor faults on parallel applications , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[35]  Dhiraj K. Pradhan,et al.  Error-Control Techniques for Logic Processors , 1972, IEEE Transactions on Computers.

[36]  Christoforos N. Hadjicostis,et al.  Structured redundancy for fault tolerance in state-space models and Petri nets , 1999, Kybernetika.

[37]  Christoforos N. Hadjicostis,et al.  Non-concurrent error detection and correction in discrete-time LTI dynamic systems , 2001, Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228).

[38]  Behrooz Parhami,et al.  Computer arithmetic - algorithms and hardware designs , 1999 .

[39]  Srinivas Aluru,et al.  Parallel biological sequence comparison using prefix computations , 2003, J. Parallel Distributed Comput..

[40]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[41]  Haridimos T. Vergos,et al.  High-Speed Parallel-Prefix Modulo 2n-1 Adders , 2000, IEEE Trans. Computers.