Fault-Tolerant Computing Systems

A concept of responsive computer systems is presented. The emerging responsive systems demand fault-tolerant and real-time performance in parallel and distributed computing environments. A new design framework for responsive systems is introduced. It is based on a fundamental problem of consensus. Then, a new measure of responsiveness for specifying fault tolerance and real-time requirements is described. Next, the design methodologies for fault-tolerant, real-time and responsive systems are discussed, and novel techniques for introducing redundancy for improved performance and dependability are illustrated.

[1]  Jürg Kohlas,et al.  Zuverlässigkeit und Verfügbarkeit , 1987 .

[2]  Wesley W. Chu,et al.  Task Allocation and Precedence Relations for Distributed Real-Time Systems , 1987, IEEE Transactions on Computers.

[3]  Subbarayan Venkatesan,et al.  Reliable protocols for distributed termination detection , 1989 .

[4]  Kishor S. Trivedi,et al.  Optimal Selection of CPU Speed, Device Capacities, and File Assignments , 1980, JACM.

[5]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[6]  John F. Meyer,et al.  Unified performance-reliability evaluation , 1984 .

[7]  Jacob A. Abraham,et al.  A Model For The Analysis Of Fault-Tolerant Signal Processing Architectures , 1988, Optics & Photonics.

[8]  Isi Mitrani,et al.  Fixed-Point Approximations for Distributed Systems , 1983, Computer Performance and Reliability.

[9]  Kishor S. Trivedi,et al.  NUMERICAL EVALUATION OF PERFORMABILITY AND JOB COMPLETION TIME IN REPAIRABLE FAULT-TOLERANT SYSTEMS. , 1990 .

[10]  Anthony S. Wojcik,et al.  A General, Constructive Approach to Fault-Tolerant Design Using Redundancy , 1989, IEEE Trans. Computers.

[11]  Bapiraju Vinnakota,et al.  A dependence graph-based approach to the design of algorithm-based fault tolerant systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[12]  Mario Blaum Systematic Unidirectional Burst Detecting Codes , 1988, IEEE Trans. Computers.

[13]  Corot W. Starke,et al.  Built-In Test for CMOS Circuits , 1984, ITC.

[14]  Bella Bose,et al.  Systematic Unidirectional Error-Detecting Codes , 1985, IEEE Transactions on Computers.

[15]  Charles J. Colbourn,et al.  The Combinatorics of Network Reliability , 1987 .

[16]  Kishor S. Trivedi,et al.  Performability Analysis: Measures, an Algorithm, and a Case Study , 1988, IEEE Trans. Computers.

[17]  A. Sangiovanni-Vincentelli,et al.  Irredundant sequential machines via optimal logic synthesis , 1990, Twenty-Third Annual Hawaii International Conference on System Sciences.

[18]  Lorenzo Strigini,et al.  Adjudicators for diverse-redundant components , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[19]  H. T. Kung,et al.  Systolic Arrays for (VLSI). , 1978 .

[20]  Donald Gross,et al.  The Randomization Technique as a Modeling Tool and Solution Procedure for Transient Markov Processes , 1984, Oper. Res..

[21]  S. S. Ravi,et al.  Design and analysis of test schemes for algorithm-based fault tolerance , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[22]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[23]  Franklin T. Luk,et al.  Fault Tolerance Techniques For Systolic Arrays , 1987, Optics & Photonics.

[24]  Satish K. Tripathi,et al.  A vertex-allocation theorem for resources in queuing networks , 1988, JACM.

[25]  H. Schmeck A comparison-based instruction systolic array , 1986 .

[26]  Heiko Schröder,et al.  Effective reconfiguration algorithms in fault-tolerant processor arrays , 1990, Comput. Syst. Sci. Eng..

[27]  F. C. Hennie Fault detecting experiments for sequential circuits , 1964, SWCT.

[28]  Vishwani D. Agrawal,et al.  An Information Theoretic Approach to Digital Fault Testing , 1981, IEEE Transactions on Computers.

[29]  Kenneth E. Batcher,et al.  Design of a Massively Parallel Processor , 1980, IEEE Transactions on Computers.

[30]  William H. Sanders,et al.  METASAN: A Performability Evaluation Tool Based on Stochastic Acitivity Networks , 1986, FJCC.

[31]  Heiko Schröder,et al.  Effective Reconfiguration Algorithms in Fault Tolerant Mesh-Connected Networks , 1989, Aust. Comput. J..

[32]  Miroslaw Malek,et al.  Survey of software tools for evaluating reliability, availability, and serviceability , 1988, CSUR.

[33]  Kang G. Shin,et al.  Measurement and Analysis of Workload Effects on Fault Latency in Real-Time Systems , 1990, IEEE Trans. Software Eng..

[34]  José A. B. Fortes,et al.  A taxonomy of reconfiguration techniques for fault-tolerant processor arrays , 1990, Computer.

[35]  Asser N. Tantawi,et al.  A General Model for Optimal Static Load Balancing in Star Network Configurations , 1984, Performance.

[36]  Ravishankar K. Iyer,et al.  A Statistical Failure/Load Relationship: Results of a Multicomputer Study , 1982, IEEE Transactions on Computers.

[37]  Kenneth C. Sevcik,et al.  Priority Scheduling Disciplines in Queuing Network Models of Computer Systems , 1977, IFIP Congress.

[38]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[39]  Ravishankar K. Iyer,et al.  The effect of system workload on error latency: an experimental study , 1985, SIGMETRICS '85.

[40]  Hans-Joachim Wunderlich,et al.  A synthesis approach to reduce scan design overhead , 1990 .

[41]  Jürgen Dunkel,et al.  On the modeling of workload dependent memory faults , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[42]  Robert K. Brayton,et al.  MIS: A Multiple-Level Logic Optimization System , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[43]  S. S. Ravi,et al.  Improved Bounds for Algorithm-Based Fault Tolerance , 1993, IEEE Trans. Computers.

[44]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[45]  Fabrizio Lombardi,et al.  Reconfiguration of VLSI arrays by covering , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[46]  D.I. Moldovan,et al.  On the design of algorithms for VLSI systolic arrays , 1983, Proceedings of the IEEE.

[47]  Klaus Echtle Fehlermodellierung bei Simulation und Verifikation von Fehlertoleranz-Algorithmen für verteilte Systeme , 1984, Software-Fehlertoleranz und -Zuverlässigkeit.

[48]  William E. Howden,et al.  Methodology for the Generation of Program Test Data , 1975, IEEE Transactions on Computers.

[49]  Eiji Fujiwara,et al.  Error-control coding for computer systems , 1989 .

[50]  Patrice Quinton,et al.  The systematic design of systolic arrays , 1987 .

[51]  Suku Nair,et al.  An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[52]  Nikolay Petkov Synthesis of Systolic Algorithms and Processor Arrays , 1986, CONPAR.

[53]  Yinong Chen,et al.  Evaluation of deterministic fault injection for fault-tolerant protocol testing , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[54]  M. Malek,et al.  A Fault-Tolerant Systolic Sorter , 1988, IEEE Trans. Computers.

[55]  Niraj K. Jha,et al.  SYSTEMATIC CODE FOR DETECTING T-UNIDIRECTIONAL ERRORS. , 1987 .

[56]  W.H. McAnney,et al.  Built-in test for RAMs , 1988, IEEE Design & Test of Computers.

[57]  William H. Sanders,et al.  A Unified Approach for Specifying Measures of Performance, Dependability and Performability , 1991 .

[58]  Roy C. Ogus,et al.  The Probability of a Correct Output from a Combinational Circuit , 1975, IEEE Transactions on Computers.

[59]  Paul S. Miner,et al.  The Effects of Latent Faults on Highly Reliable Computer Systems , 1987, IEEE Transactions on Computers.

[60]  Luigi V. Mancini Modular redundancy in a message passing system , 1986, IEEE Transactions on Software Engineering.

[61]  Klaus Echtle Fault Masking and Sequence Agreement by a Voting Protocol with low Message Number , 1987, SRDS.

[62]  William C. Carter,et al.  Design of dynamically checked computers , 1968, IFIP Congress.

[63]  Balbir S. Dhillon,et al.  Reliability in computer system design , 1987 .

[64]  Marco Ajmone Marsan,et al.  Performance models of multiprocessor systems , 1987, MIT Press series in computer systems.

[65]  Edmundo de Souza e Silva,et al.  Calculating availability and performability measures of repairable computer systems using randomization , 1989, JACM.

[66]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[67]  Richard E Barlow,et al.  Introduction to Fault Tree Analysis , 1973 .

[68]  Hartmut Schmeck,et al.  Given's rotation on an instruction systolic array , 1988, Parcella.

[69]  Jacob A. Abraham,et al.  DESIGN OF PLAS WITH CONCURRENT ERROR DETECTION. , 1982 .

[70]  Satish K. Tripathi,et al.  Resource allocation with fault tolerance , 1989 .

[71]  Prithviraj Banerjee,et al.  Algorithms-Based Fault Detection for Signal Processing Applications , 1990, IEEE Trans. Computers.

[72]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[73]  Ignas G. Niemegeers,et al.  Performability Modelling Using Dynamic Queueing Networks , 1989, SIGMETRICS.

[74]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[75]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[76]  Ed F. Deprettere,et al.  A design methodology for fixed-size systolic arrays , 1990, [1990] Proceedings of the International Conference on Application Specific Array Processors.

[77]  Vishwani D. Agrawal,et al.  Design of sequential machines for efficient test generation , 1989, 1989 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers.

[78]  Hao Dong Modified Berger Codes for Detection of Unidirectional Errors , 1984, IEEE Trans. Computers.

[79]  Kozo Kinoshita,et al.  Easily Testable Sequential Machines with Extra Inputs , 1975, IEEE Transactions on Computers.

[80]  Bella Bose Burst Unidirectional Error-Detecting Codes , 1986, IEEE Transactions on Computers.

[81]  Liuba Shrira,et al.  A replicated Unix file system , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[82]  Pierre-Jacques Courtois,et al.  On time and space decomposition of complex structures , 1985, CACM.

[83]  Niraj K. Jha,et al.  Diagnosability and diagnosis of algorithm-based fault tolerant systems , 1989, Proceedings of the 32nd Midwest Symposium on Circuits and Systems,.

[84]  Hans-Werner Lang Das befehlssystolische Prozessorfeld: Architektur und Programmierung , 1989 .

[85]  Ravishankar K. Iyer,et al.  DEPEND: a design environment for prediction and evaluation of system dependability , 1990, 9th IEEE/AIAA/NASA Conference on Digital Avionics Systems.

[86]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[87]  Horst Daar Erhöhung der Wirtschaftlichkeit von Automatisierungssystemen durch projektierbare Redundanz , 1988, Prozeßrechnersysteme.

[88]  Lu Wei,et al.  Influence of Workload on Error Recovery in Random Access Memories , 1988, IEEE Trans. Computers.

[89]  William Stallings,et al.  Local networks: An introduction , 1984 .

[90]  Lawrence W. Dowdy,et al.  Comparative Models of the File Assignment Problem , 1982, CSUR.

[91]  P. Goel Test Generation Costs Analysis and Projections , 1980, 17th Design Automation Conference.

[92]  William H. Sanders,et al.  Performability Evaluation of Distributed Systems Using Stochastic Activity Networks , 1987, PNPM.

[93]  Winfrid G. Schneeweiss,et al.  Boolean functions - with engineering applications and computer programs , 1989 .

[94]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[95]  Anurag Kumar,et al.  Adaptive optimal load balancing in a heterogeneous multiserver system with a central job scheduler , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[96]  K. K. Ramakrishnan,et al.  A Resource Allocation Policy Using Time Thresholding , 1983, Performance.

[97]  Peter Strazdins Control structures for mesh-connected networks , 1990 .

[98]  J. Annevelink,et al.  HIFI: a functional design system for VLSI processing arrays , 1988, [1988] Proceedings. International Conference on Systolic Arrays.

[99]  Santosh K. Shrivastava,et al.  Exception Handling in Replicated Systems with Voting , 1986 .

[100]  Younggap You,et al.  A Self-Testing Dynamic RAM Chip , 1985, IEEE Journal of Solid-State Circuits.

[101]  R. Michael Hord,et al.  The Illiac IV, the first supercomputer , 1982 .

[102]  E. Kay,et al.  Graph Theory. An Algorithmic Approach , 1975 .

[103]  Santosh K. Shrivastava,et al.  Preventing state divergence in replicated distributed programs , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[104]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[105]  Michael Nicolaidis,et al.  Self-exercising checkers for unified built-in self-test (UBIST) , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[106]  Andrea Bondavalli,et al.  Structured software fault-tolerance with BSM , 1992, Proceedings of the Third Workshop on Future Trends of Distributed Computing Systems.

[107]  Klaus Echtle Distance agreement protocols , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[108]  Guenter Klas,et al.  Generierung und analytische Auswertung stochastischer Petri-Netz-Modelle zur Bewertung komplexer Rechensysteme , 1990, ARCS.

[109]  J. G. Mcgough,et al.  New results in fault latency modelling , 1983 .

[110]  Hans-Werner Lang The instruction systolic array - a parallel architecture for VLSI , 1986, Integr..

[111]  Kang G. Shin,et al.  Error Detection Process - Model, Design, and Its Impact on Computer Performance , 1984, IEEE Trans. Computers.

[112]  Edward D. Lazowska,et al.  Quantitative System Performance , 1985, Int. CMG Conference.

[113]  W. C. Carter,et al.  Reliability modeling techniques for self-repairing computer systems , 1969, ACM '69.

[114]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[115]  Sheldon M. Ross,et al.  Introduction to probability models , 1975 .

[116]  Vishwani D. Agrawal,et al.  An economical scan design for sequential logic test generation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[117]  J. C. Laprie,et al.  Dependability modeling and evaluation of hardware-and-software systems , 1992 .

[118]  Francesca Saglietti The Impact of Voter Granularity in Fault-Tolerant Software on System Reliability and Avaiability , 1989 .

[119]  S.H. Lee,et al.  Reliability Evaluation of a Flow Network , 1980, IEEE Transactions on Reliability.

[120]  Per Brinch Hansen,et al.  Distributed processes: a concurrent programming concept , 1978, CACM.

[121]  Winfried Görke Fehlertolerante Rechensysteme , 1989, Handbuch der Informatik.

[122]  Charles E. Stroud,et al.  Design for testability and test generation for static redundancy system level fault-tolerant circuits , 1989, Proceedings. 'Meeting the Tests of Time'., International Test Conference.

[123]  John F. Wakerly,et al.  Error detecting codes, self-checking circuits and applications , 1978 .

[124]  A. Bondavalli,et al.  Dataflow-like model for robust computations , 1989 .

[125]  M. Tsunoyama,et al.  A fault-tolerant FFT processor , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[126]  Marco Ajmone Marsan,et al.  A class of generalised stochastic petri nets for the performance evaluation of multiprocessor systems , 1983, SIGMETRICS '83.

[127]  Francesca Saglietti,et al.  Back-to-Back Teststrategien zur Validation fehlertolerierender Software-Systeme , 1988, Prozeßrechnersysteme.

[128]  Gernot Metze,et al.  Design of Totally Self-Checking Check Circuits for m-Out-of-n Codes , 1973, IEEE Transactions on Computers.

[129]  Manfred Kunde,et al.  The instruction systolic array and its relation to other models of parallel computers , 1988, Parallel Comput..

[130]  Daniel P. Siewiorek Fault tolerance in commercial computers , 1990, Computer.

[131]  S. Osder The DC-9-80 digital flight guidance system's monitoring techniques , 1979 .

[132]  Antonio Rubio,et al.  Easily testable iterative unidimensional CMOS circuits , 1989, [1989] Proceedings of the 1st European Test Conference.

[133]  Satish K. Tripathi,et al.  Optimal allocation of file servers in a local network environment , 1986, IEEE Transactions on Software Engineering.

[134]  Janak H. Patel,et al.  Design and Algorithms for Parallel Testing of Random Access and Content Addressable Memories , 1987, 24th ACM/IEEE Design Automation Conference.

[135]  Mariagiovanna Sami,et al.  Fault-tolerance through reconfiguration of VLSI and WSI awards , 1989, MIT Press series in computer systems.

[136]  Jacob A. Abraham,et al.  A Probabilistic Model of Algorithm-Based Fault Tolerance in Array Processors for Real-Time Systems , 1986, RTSS.

[137]  Sudhakar M. Reddy,et al.  On Totally Self-Checking Checkers for Separable Codes , 1977, IEEE Transactions on Computers.

[138]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.