Error and failure analysis of a UNIX server

This paper presents a measurement-based dependability study of a UNIX server. The event logs of a UNIX server are collected to form the dependability data basis. Message logs spanning approximately eleven months were collected for this study. The event log data are classified and categorized to calculate parameters such as MTBF and availability. Component analysis is also performed to identify modules that are prone to errors in the system. Next, the system error activity proceeding each system failure is analyzed to identify error patterns that may be precursors of the observed failure events. Lastly, the error/failure results from the measurement are reviewed in the perspective of the fault/error assumptions made in several popular fault injection studies.

[1]  Daniel P. Siewiorek,et al.  A comparative analysis of event tupling schemes , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[2]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[3]  Ravishankar K. Iyer,et al.  Study of fault propagation using fault injection in the UNIX system , 1993, Proceedings of 1993 IEEE 2nd Asian Test Symposium (ATS).

[4]  Daniel P. Siewiorek,et al.  WORKLOAD, PERFORMANCE, AND RELlABlLlTY OF DIGITAL COMPUTlNG SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[5]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[6]  Ravishankar K. Iyer,et al.  Analysis and Modeling of Correlated Failures in Multicomputer Systems , 1992, IEEE Trans. Computers.

[7]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[8]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[9]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[10]  Jacob A. Abraham,et al.  FERRARI: a flexible-based fault and error injection system , 1995 .

[11]  Ravishankar K. Iyer,et al.  DEFINE: a distributed fault injection and monitoring environment , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[12]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Ravishankar K. Iyer,et al.  Analysis of failures in the Tandem NonStop-UX Operating System , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[14]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[15]  Daniel P. Siewiorek,et al.  Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .

[16]  Craig A. Knoblock,et al.  Advanced Programming in the UNIX Environment , 1992, Addison-Wesley professional computing series.

[17]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[18]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[19]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[20]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[21]  Dhiraj K. Pradhan,et al.  Fault Injection: A Method for Validating Computer-System Dependability , 1995, Computer.

[22]  Daniel P. Siewiorek,et al.  VAX/VMS event monitoring and analysis , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.