The Effect of Program Behavior on Fault Observability

Fault observability based on the behavior of memory references is studied. Traditional studies view memory as one monolithic entity that must completely work to be considered reliable. The usage patterns of a particular program's memory are emphasized here. This paper develops a new model for the successful execution of a program taking into account the usage of the data by extending a cache memory performance model. Three variations, based on well known allocation schemes, are presented (i.e., whether the program's storage is preallocated, dynamically allocated, or constrained in allocation). This is contrasted to traditional memory reliability calculations to show that the actual mean time to failure may be more optimistic when program behavior is considered. It also develops expressions for the probability of unobserved faults. With several studies reporting correlations between increased workloads and increased failure rates, a new theory is proposed here that provides an explanation for this behavior. The model studies several program traces demonstrating that increased workloads could cause an increase of the observed failure rates in the range of 32% to 53%.

[1]  Mario Blaum,et al.  The Reliability of Single-Error Protected Computer Memories , 1988, IEEE Trans. Computers.

[2]  Dominique Thiébaut,et al.  On the Fractal Dimension of Computer Programs and its Application to the Prediction of the Cache Miss Ratio , 1989, IEEE Trans. Computers.

[3]  Dominique Thiébaut,et al.  From the Fractal Dimension of the Intermiss Gaps to the Cache-Miss Ratio , 1988, IBM J. Res. Dev..

[4]  Philip L. Rosenfeld,et al.  Fractal Nature of Software-Cache Interaction , 1983, IBM J. Res. Dev..

[5]  Daniel P. Siewiorek,et al.  WORKLOAD, PERFORMANCE, AND RELlABlLlTY OF DIGITAL COMPUTlNG SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[6]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Albert Endres,et al.  An analysis of errors and their causes in system programs , 1975, IEEE Transactions on Software Engineering.

[8]  Dhiraj K. Pradhan,et al.  Modeling of Live Lines and True Sharing in Multi-Cache Memory Systems , 1990, ICPP.

[9]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Ravishankar K. Iyer,et al.  A Statistical Failure/Load Relationship: Results of a Multicomputer Study , 1982, IEEE Transactions on Computers.

[11]  Ravishankar K. Iyer,et al.  An Experimental Study of Memory Fault Latency , 1989, IEEE Trans. Computers.

[12]  Joel L. Wolf,et al.  Synthetic Traces for Trace-Driven Simulation of Cache Memories , 1992, IEEE Trans. Computers.

[13]  Ravishankar K. Iyer,et al.  A Measurement-Based Model for Workload Dependence of CPU Errors , 1986, IEEE Transactions on Computers.

[14]  Daniel P. Siewiorek,et al.  Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .

[15]  Dhiraj K. Pradhan,et al.  Flip-Trees: Fault-Tolerant Graphs with Wide Containers , 1988, IEEE Trans. Computers.

[16]  Lu Wei,et al.  Analysis of workload influence on dependability , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[17]  Daniel P. Siewiorek,et al.  Reliability and Performance of Error-Correcting Memory and Register Arrays , 1980, IEEE Transactions on Computers.

[18]  Ravishankar K. Iyer,et al.  A Simulation-Based Study of a Triple Modular Redundant System Using DEFEND , 1991, Fault-Tolerant Computing Systems.

[19]  Daniel P. Siewiorek,et al.  Effects of transient gate-level faults on program behavior , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[20]  Wolfgang Hohl,et al.  Fault-Tolerant Computing Systems , 1991, Informatik-Fachberichte.

[21]  Miroslaw Malek,et al.  Fault-Tolerant Semiconductor Memories , 1984, Computer.

[22]  Ravishankar K. Iyer,et al.  Measurement-Based Analysis of Error Latency , 1987, IEEE Transactions on Computers.

[23]  Lu Wei,et al.  Influence of Workload on Error Recovery in Random Access Memories , 1988, IEEE Trans. Computers.

[24]  Dhiraj K. Pradhan,et al.  Fault Injection: A Method for Validating Computer-System Dependability , 1995, Computer.

[25]  Kimming So,et al.  Cache Operations by MRU Change , 1988, IEEE Trans. Computers.

[26]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[27]  Dhiraj K. Pradhan,et al.  Program fault tolerance based on memory access behavior , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[28]  W. F. Mikhail,et al.  The Reliability of Memory with Single-Error Correction , 1982, IEEE Transactions on Computers.

[29]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..