A general procedure for error detection in complex systems, called the data block capture and analysis monitoring process, is described and analyzed. It is assumed that, in addition to being exposed to potential external fault sources, a complex system will in general always contain embedded hardware and software fault mechanisms which can cause the system to perform incorrect computations and/or produce incorrect output. Thus, in operation, the system continuously moves back and forth between error and no-error states. These external fault sources or internal fault mechanisms are extremely difficult to detect. The data block capture and analysis monitoring process is concerned with detecting deviations from the normal performance of the system, known as errors, which are symptomatic of fault conditions. The process consists of repeatedly recording a fixed amount of data from a set of predetermined observation lines of the system being monitored (i.e. capturing a block of data) and then analyzing the captured block in an attempt to determine whether the system is functioning correctly. >
[1]
Robert S. Swarz,et al.
The theory and practice of reliable system design
,
1982
.
[2]
Erhan Çinlar,et al.
Introduction to stochastic processes
,
1974
.
[3]
D. R. Miller.
Reliability calculation using randomization for Markovian fault-tolerant computing systems
,
1982
.
[4]
L. F. Pau.
Failure Diagnosis and Performance Monitoring
,
1986,
IEEE Transactions on Reliability.
[5]
H. M. Taylor,et al.
An introduction to stochastic modeling
,
1985
.
[6]
Douglas M. Blough,et al.
Fault detection and diagnosis in multiprocessor systems
,
1988
.
[7]
W. Kent Fuchs.
A specification-based approach to concurrent structure verification in multiprocessor systems
,
1986
.
[8]
Algirdas Avizienis.
Fault tolerance by means of external monitoring of computer systems
,
1981,
AFIPS '81.
[9]
L. F. Pau,et al.
Applications of pattern recognition to the diagnosis of equipment failures
,
1974,
Pattern Recognit..