Coding approaches to fault tolerance in dynamic systems

A fault-tolerant system tolerates internal failures while preserving desirable overall behavior. Fault tolerance is necessary in life-critical or inaccessible applications, and also enables the design of reliable systems out of unreliable, less expensive components. This thesis discusses fault tolerance in dynamic systems, such as finite-state controllers or computer simulations, whose internal state influences their future behavior. Modular redundancy (system replication) and other traditional techniques for fault tolerance are expensive, and rely heavily—particularly in the case of dynamic systems operating over extended time horizons—on the assumption that the error-correcting mechanism (e.g., voting) is fault-free. The thesis develops a systematic methodology for adding structured redundancy to a dynamic system and introducing associated fault tolerance. Our approach exposes a wide range of possibilities between no redundancy and full replication. Assuming that the error-correcting mechanism is fault-free, we parameterize the different possibilities in various settings, including algebraic machines, linear dynamic systems and Petri nets. By adopting specific error models and, in some cases, by making explicit connections with hardware implementations, we demonstrate how the redundant systems can be designed to allow detection/correction of a fixed number of failures. We do not explicitly address optimization criteria that could be used in choosing among different redundant implementations, but our examples illustrate how such criteria can be investigated in future work. The last part of the thesis relaxes the traditional assumption that error-correction be fault-free. We use unreliable system replicas and unreliable voters to construct redundant dynamic systems that evolve in time with low probability of failure. Our approach generalizes modular redundancy by using distributed voting schemes. Combining these techniques with low-complexity error-correcting coding, we are able to efficiently protect identical unreliable linear finite-state machines that operate in parallel on distinct input sequences. The approach requires only a constant amount of redundant hardware per machine to achieve a probability of failure that remains below any pre-specified bound over any given finite time interval. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  W. Wonham Linear Multivariable Control: A Geometric Approach , 1974 .

[2]  Franklin T. Luk,et al.  Fault Tolerance Techniques For Systolic Arrays , 1987, Optics & Photonics.

[3]  B. Parhami Voting algorithms , 1994 .

[4]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[5]  Daniel A. Spielman,et al.  Linear-time encodable and decodable error-correcting codes , 1995, STOC '95.

[6]  Frank H. Sumner,et al.  Reliable computation in the presence of noise , 1965 .

[7]  Suku Nair,et al.  Real-Number Codes for Bault-Tolerant Matrix Operations On Processor Arrays , 1990, IEEE Trans. Computers.

[8]  Panos J. Antsaklis,et al.  Supervisory Control of Discrete Event Systems Using Petri Nets , 1998, The International Series on Discrete Event Dynamic Systems.

[9]  Tadao Murata,et al.  Petri nets: Properties, analysis and applications , 1989, Proc. IEEE.

[10]  Thammavarapu R. N. Rao,et al.  Error coding for arithmetic processors , 1974 .

[11]  Abhijit Chatterjee Concurrent Error Detection in Linear Analog and Switched-Capacitor State Variable Systems Using Continuous Checksums , 1991 .

[12]  Jon C. Muzio,et al.  Analysis of One-Dimensional Linear Hybrid Cellular Automata over GF(q) , 1996, IEEE Trans. Computers.

[13]  Abhijit Chatterjee,et al.  The Design of Fault-Tolerant Linear Digital State Variable Systems: Theory and Techniques , 1993, IEEE Trans. Computers.

[14]  J. Hartmanis,et al.  Algebraic Structure Theory Of Sequential Machines , 1966 .

[15]  Taylor L. Booth,et al.  Sequential machines and automata theory , 1967 .

[16]  John Norton Structural zeros in the modal matrix and its inverse , 1980 .

[17]  Rubin A. Parekhji,et al.  A Methodology for Designing Optimal Self-Checking Sequential Circuits , 1991, 1991, Proceedings. International Test Conference.

[18]  Algirdas Avizienis,et al.  The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design , 1971, IEEE Transactions on Computers.

[19]  G. R. Redinbo,et al.  Probability of State Transition Errors in a Finite State Machine Containing Soft Failures , 1984, IEEE Transactions on Computers.

[20]  S. Niranjan,et al.  A comparison of fault-tolerant state machine architectures for space-borne electronics , 1996, IEEE Trans. Reliab..

[21]  Theme Feature Toward Systematic Design of Fault- Tolerant Systems , 1997 .

[22]  Péter Gács,et al.  Reliable computation with cellular automata , 1983, J. Comput. Syst. Sci..

[23]  R. Gallager Information Theory and Reliable Communication , 1968 .

[24]  Norbert Wehn,et al.  The Hyeti Defect Tolerant Microprocessor: A Practical Experiment and its Cost-Effectiveness Analysis , 1994, IEEE Trans. Computers.

[25]  Dhiraj K. Pradhan,et al.  Fault-tolerant computing : theory and techniques , 1986 .

[26]  D. White,et al.  Expansion and contraction of linear time-varying systems , 1982, 1982 21st IEEE Conference on Decision and Control.

[27]  N. Jacobson,et al.  Basic Algebra I , 1976 .

[28]  Thomas Müller-Wipperfürth,et al.  FSM decomposition revisited: algebraic structure theory applied to MCNC benchmark FSMs , 1991, 28th ACM/IEEE Design Automation Conference.

[29]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[30]  Michael G. Taylor Reliable information storage in memories designed from unreliable components , 1968 .

[31]  S. Toumodge Applications of Petri Nets in Manufacturing systems; Modeling, Control, and Performance Analysis [Book review] , 1995, IEEE Control Systems.

[32]  D. L. Tao,et al.  Evaluating Reliability Improvements of Fault Tolerant Array Processors Using Algorithm-Based Fault Tolerance , 1997, IEEE Trans. Computers.

[33]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[34]  D. Siljak,et al.  An inclusion principle for dynamic systems , 1984 .

[35]  Robert H. Halstead,et al.  Computation structures , 1990, MIT electrical engineering and computer science series.

[36]  Christoforos N. Hadjicostis,et al.  Fault-tolerant computation in semigroups and semirings , 1995 .

[37]  C. Reutenauer The Mathematics of Petri Nets , 1990 .

[38]  Israel Koren,et al.  Fault tolerance in VLSI circuits , 1990, Computer.

[39]  John W. Bunce,et al.  Linear Systems over Commutative Rings , 1986 .

[40]  Parimal Pal Chaudhuri,et al.  Theory and Application of Nongroup Cellular Automata for Synthesis of Easily Testable Finite State Machines , 1996, IEEE Trans. Computers.

[41]  J. Håstad Computational limitations of small-depth circuits , 1987 .

[42]  Abraham Ginzburg,et al.  Algebraic theory of automata , 1968 .

[43]  T. Williams,et al.  Aliasing errors in linear automata used as multiple-input signature analyzers , 1990 .

[44]  Daniel A. Spielman,et al.  Highly fault-tolerant parallel computation , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[45]  I. Herstein,et al.  Topics in algebra , 1964 .

[46]  R. Lidl,et al.  Applied abstract algebra , 1984 .

[47]  Robert L. Martin,et al.  Studies in Feedback Shift Register Synthesis of Sequential Machines , 1969 .

[48]  Bruce R. Musicus,et al.  Fast fault-tolerant digital convolution using a polynomial residue number system , 1993, IEEE Trans. Signal Process..

[49]  Kurt Keutzer,et al.  Logic Synthesis , 1994 .

[50]  A. Sengupta,et al.  Realization of Fault-Tolerant Machines—Linear Code Application , 1981, IEEE Transactions on Computers.

[51]  Kwang-Ting Cheng,et al.  A functional fault model for sequential machines , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[52]  Solomon W. Golomb,et al.  Shift Register Sequences , 1981 .

[53]  Giuseppe Caire,et al.  Linear block codes over cyclic groups , 1995, IEEE Trans. Inf. Theory.

[54]  D. Siljak,et al.  Generalized decompositions of dynamic systems and vector Lyapunov functions , 1981 .

[55]  Prithviraj Banerjee,et al.  Fault tolerant VLSI systems , 1993 .

[56]  Jaynarayan H. Lala,et al.  FAULT-TOLERANT PARALLEL PROCESSOR , 1991 .

[57]  Michael A. Arbib,et al.  Theories of abstract automata , 1969, Prentice-Hall series in automatic computation.

[58]  Sy-Yen Kuo,et al.  Concurrent error detection and correction in real-time systolic sorting arrays , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[59]  Jianli Sun,et al.  Fault Tolerance in a Class of Sorting Networks , 1994, IEEE Trans. Computers.

[60]  Nicholas Pippenger,et al.  Developments in "The synthesis of reliable organ-isms from unreliable components , 1990 .

[61]  Paul E. Beckmann,et al.  A group-theoretic framework for fault-tolerant computation , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[62]  Piero Olivo,et al.  Analysis and Design of Linear Finite State Machines for Signature Analysis Testing , 1991, IEEE Trans. Computers.

[63]  David Lee,et al.  Principles and methods of testing finite state machines-a survey , 1996, Proc. IEEE.

[64]  A. Willsky,et al.  Finite group homomorphic sequential system , 1972 .

[65]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[66]  C. Leake Synchronization and Linearity: An Algebra for Discrete Event Systems , 1994 .

[67]  Thomas Kailath,et al.  Linear Systems , 1980 .

[68]  D. Spielman,et al.  Expander codes , 1996 .

[69]  Irving S. Reed,et al.  Redundancy by Coding Versus Redundancy by Replication for Failure-Tolerant Sequential Circuits , 1972, IEEE Transactions on Computers.

[70]  M. Malek,et al.  A Fault-Tolerant Systolic Sorter , 1988, IEEE Trans. Computers.

[71]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[72]  David J. C. MacKay,et al.  Good Codes Based on Very Sparse Matrices , 1995, IMACC.

[73]  Eiji Fujiwara,et al.  Error-control coding for computer systems , 1989 .

[74]  Jacob A. Abraham,et al.  Fault-Tolerant Systems For The Computation Of Eigenvalues And Singular Values , 1986, Optics & Photonics.

[75]  Sandro Zampieri,et al.  Dynamical systems and convolutional codes over finite Abelian groups , 1996, IEEE Trans. Inf. Theory.

[76]  Bruce R. Musicus,et al.  Fault-tolerant computation using algebraic homomorphisms , 1992 .

[77]  Amber Roy-Chowdhury,et al.  Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems , 1996, IEEE Trans. Computers.

[78]  Michael A. Harrison,et al.  Lectures on linear sequential machines , 1969 .

[79]  Nicholas Pippenger,et al.  On networks of noisy gates , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[80]  J. Quadrat,et al.  Algebraic tools for the performance evaluation of discrete event systems , 1989, Proc. IEEE.

[81]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[82]  Michael G. Taylor Reliable computation in computing systems designed from unreliable components , 1968 .

[83]  Bernard P. Zeigler Every Discrete Input Machine is Linearly Simulatable , 1973, J. Comput. Syst. Sci..

[84]  John F. Wakerly,et al.  Error detecting codes, self-checking circuits and applications , 1978 .

[85]  G. Robert Redinbo,et al.  Finite Field Fault-Tolerant Digital Filtering Architectures , 1987, IEEE Transactions on Computers.

[86]  Michael Gordon Taylor Randomly perturbed computation systems. , 1966 .

[87]  Srinivas Devadas,et al.  Optimum and heuristic algorithms for an approach to finite state machine decomposition , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[88]  Michael A. Arbib,et al.  Algebraic theory of machines, languages and semigroups , 1969 .

[89]  Larry L. Kinney,et al.  Concurrent Fault Detection in Microprogrammed Control Units , 1985, IEEE Transactions on Computers.

[90]  Rubin A. Parekhji,et al.  Concurrent error detection using monitoring machines , 1995, IEEE Design & Test of Computers.