Concurrent Error Detection Using Watchdog Processors - A Survey

Concurrent system-level error detection techniques using a watchdog processor are surveyed. A watchdog processor is a small and simple coprocessor that detects errors by monitoring the behavior of a system. Like replication, it does not depend on any fault model for error detection. However, it requires less hardware than replication. It is shown that a large number of errors can be detected by monitoring the control flow and memory-access behavior. Two techniques for control-flow checking are discussed and compared with current error-detection techniques. A scheme for memory-access checking based on capability-based addressing is described. The design of a watchdog for performing reasonable checks on the output of a main processor by executing assertions is discussed. >

[1]  Edward J. McCluskey,et al.  Concurrent System-Level Error Detection Using a Watchdog Processor , 1985, ITC.

[2]  Robert S. Fabry,et al.  Capability-based addressing , 1974, CACM.

[3]  Y. Crouzet,et al.  A 6800 coprocessor for error detection in microcomputers: The PAD , 1986, Proceedings of the IEEE.

[4]  David J. Lu Watchdog Processors and Structural Integrity Checking , 1982, IEEE Transactions on Computers.

[5]  Nancy G. Leveson,et al.  Analyzing Software Safety , 1983, IEEE Transactions on Software Engineering.

[6]  Gerald Estrin,et al.  Snuper computer: a computer in instrumentation automaton , 1967, AFIPS '67 (Spring).

[7]  William R. Crowther,et al.  Pluribus: a reliable multiprocessor , 1975, AFIPS '75.

[8]  Richard M. Sedmak,et al.  Fault Tolerance of a General Purpose Computer Implemented by Very Large Scale Integration , 1980, IEEE Transactions on Computers.

[9]  John Paul Shen A roving monitoring processor for detection of control flow errors in multiple processor systems , 1987 .

[10]  Robert W. Floyd,et al.  Assigning meaning to programs , 1967 .

[11]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[12]  C. A. R. Hoare,et al.  An axiomatic basis for computer programming , 1969, CACM.

[13]  Stephen S. Yau,et al.  Concurrent software fault detection , 1975, IEEE Transactions on Software Engineering.

[14]  T. F. Storey Design of a microprogram control for a processor in an electronic switching system , 1976, The Bell System Technical Journal.

[15]  Kang G. Shin,et al.  Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks , 1984, IEEE Transactions on Computers.

[16]  Michael J. Flynn,et al.  Comparative Analysis of Computer Architectures , 1983, IFIP Congress.

[17]  D. Luckham,et al.  ANNA: towards a language for annotating Ada programs , 1980, SIGPLAN.

[18]  Michael A. Malcolm,et al.  Computer methods for mathematical computations , 1977 .

[19]  S. H. Saib Executable Assertions - An Aid To Reliable Software , 1977 .

[20]  John Paul Shen,et al.  On-Line Self-Monitoring Using Signatured Instruction Streams , 1983, International Test Conference.

[21]  Jacob A. Abraham,et al.  Test Generation for Microprocessors , 1980, IEEE Transactions on Computers.

[22]  C. V. Ramamoorthy,et al.  Failure-tolerant parallel programming and its supporting system architecture , 1976, AFIPS '76.

[23]  Stephen S. Yau,et al.  An Approach to Concurrent Control Flow Checking , 1980, IEEE Transactions on Software Engineering.

[24]  L. Yount Architectural solutions to safety problems of digital flight Critical systems for commercial transports , 1984 .

[25]  R. E. Staehler 1a processor: Organization and objectives , 1977, The Bell System Technical Journal.

[26]  T. S. Liu The Role of a Maintenance Processor for a General-Purpose Computer System , 1984, IEEE Transactions on Computers.

[27]  Edward J. McCluskey,et al.  Writing executable assertions to test flight software , 1984 .

[28]  Masood Namjoo,et al.  Techniques for Concurrent Testing of VLSI Processor Operation , 1982, ITC.

[29]  Edward J. McCluskey,et al.  Executable assertions and flight software , 1984 .

[30]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[31]  D.P. Siewiorek,et al.  A case study of C.mmp, Cm*, and C.vmp: Part I—Experiences with fault tolerance in multiprocessor systems , 1978, Proceedings of the IEEE.

[32]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[33]  John E. Bauer,et al.  An Advanced Fault Isolation System for Digital Logic , 1975, IEEE Transactions on Computers.

[34]  Leon G. Stucki,et al.  New assertion concepts for self-metric software validation , 1975, Reliable Software.

[35]  Edward J. McCluskey,et al.  Concurrent Fault Detection Using a Watchdog Processor and Assertions , 1983, ITC.

[36]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[37]  Leonard Jay Shustek,et al.  Analysis and performance of computer instruction sets , 1978 .

[38]  Algirdas Avizienis Fault tolerance by means of external monitoring of computer systems , 1981, AFIPS '81.

[39]  Dorothy M. Andrews,et al.  An automated program testing methodology and its implementation , 1981, ICSE '81.

[40]  James P. Black,et al.  Principles of Data Structure Error Correction , 1982, IEEE Transactions on Computers.

[41]  Satish M. Thatte,et al.  Concurrent Checking of Program Flow in VLSI Processors , 1982, ITC.

[42]  Masood Namjoo Design of concurrently testable microprogrammed control units , 1982, MICRO 15.

[43]  Robert W. Cook,et al.  Design of a Self-Checking Microprogram Control , 1973, IEEE Transactions on Computers.

[44]  John A. Herndon,et al.  No. 2 ESS: Service features and call processing plan , 1969 .

[45]  Larry L. Kinney,et al.  Concurrent Fault Detection in Microprogrammed Control Units , 1985, IEEE Transactions on Computers.

[46]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[47]  Gerald Estrin,et al.  Snuper Computer - A Computer Instrumentation Automation , 1899 .

[48]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.