Measurement-based Analysis of Networked System Availability

The dependability of a system can be experimentally evaluated at different phases of its life cycle. In the design phase, computer-aided design (CAD) environments are used to evaluate the design via simulation, including simulated fault injection. Such fault injection tests the effectiveness of fault-tolerant mechanisms and evaluates system dependability, providing timely feedback to system designers. Simulation, however, requires accurate input parameters and validation of output results. Although the parameter estimates can be obtained from past measurements, this is often complicated by design and technology changes. In the prototype phase, the system runs under controlled workload conditions. In this stage, controlled physical fault injection is used to evaluate the system behavior under faults, including the detection coverage and the recovery capability of various fault tolerance mechanisms. Fault injection on the real system can provide information about the failure process, from fault occurrence to system recovery, including error latency, propagation, detection, and recovery (which may involve reconfiguration). But this type of fault injection can only study artificial faults; it cannot provide certain important dependability measures, such as mean time between failures (MTBF) and availability. In the operational phase, a direct measurement-based approach can be used to measure systems in the field under real workloads. The collected data contain a large amount of information about naturally occurring errors/failures.

[1]  W. Richard Stevens,et al.  TCP/IP Illustrated, Volume 1: The Protocols , 1994 .

[2]  Ravishankar K. Iyer,et al.  Analysis of the VAX/VMS error logs in multicomputer environments-a case study of software dependability , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[3]  J. Stanley Warford Computer Systems , 1998 .

[4]  Ravishankar K. Iyer,et al.  Reliability of Internet Hosts: A Case Study from the End User's Perspective , 1999, Comput. Networks.

[5]  I. Lee,et al.  Measurement-based evaluation of operating system fault tolerance , 1993 .

[6]  Mahesh Chittur Kalyanakrishnan,et al.  Analysis of Failures in Windows NT Systems , 1998 .

[7]  Frank Feather,et al.  A case study of Ethernet anomalies in a distributed computing environment , 1990 .

[8]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[9]  Ravishankar K. Iyer,et al.  Dependability Measurement and Modeling of a Multicomputer System , 1993, IEEE Trans. Computers.

[10]  Kishor S. Trivedi,et al.  Performability Modeling Based on Real Data: A Case Study , 1988, IEEE Trans. Computers.

[11]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[12]  Daniel P. Siewiorek,et al.  Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .

[13]  Joanne Bechta Dugan Correlated Hardware Failures in Redundant Systems , 1992 .

[14]  Ravishankar K. Iyer,et al.  Analyze-NOW-an environment for collection and analysis of failures in a network of workstations , 1996, IEEE Trans. Reliab..

[15]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[16]  Ravishankar K. Iyer,et al.  Impact of Correlated Failures on Dependability in a VAXcluster System , 1992 .

[17]  インターグループ SAS user's guide : basics , 1986 .

[18]  Roy A. Maxion,et al.  Detection and discrimination of injected network faults , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[19]  Carl E. Landwehr,et al.  Dependable Computing for Critical Applications 4 , 1995, Dependable Computing and Fault-Tolerant Systems.

[20]  Ravishankar K. Iyer,et al.  Effect of System Workload on Operating System Reliability: A Study on IBM 3081 , 1985, IEEE Transactions on Software Engineering.

[21]  Roy A. Maxion,et al.  Anomaly detection for diagnosis , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[22]  Kishor S. Trivedi,et al.  Reliability Modeling Using SHARPE , 1987, IEEE Transactions on Reliability.

[23]  Ravishankar K. Iyer,et al.  MEASURE+: a measurement-based dependability analysis package , 1993, SIGMETRICS '93.

[24]  Ravishankar K. Iyer,et al.  A STATISTICAL LOAD DEPENDENCY MODEL FOR CPU ERRORS AT SLAC , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[25]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[26]  Martin F. Arlitt,et al.  Web server workload characterization: the search for invariants , 1996, SIGMETRICS '96.

[27]  Ravishankar K. Iyer,et al.  Analysis and Modeling of Correlated Failures in Multicomputer Systems , 1992, IEEE Trans. Computers.

[28]  Ravishankar K. Iyer,et al.  Analysis of software halts in the tandem GUARDIAN operating system , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[29]  Bilal Chinoy Dynamics of internet routing information , 1993, SIGCOMM 1993.

[30]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[31]  D.P. Siewiorek,et al.  A case study of C.mmp, Cm*, and C.vmp: Part I—Experiences with fault tolerance in multiprocessor systems , 1978, Proceedings of the IEEE.

[32]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[33]  A. Sathaye,et al.  Validating complex computer system availability models , 1990 .

[34]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[35]  Ravishankar K. Iyer,et al.  Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.

[36]  Paola Velardi,et al.  Hardware-Related Software Errors: Measurement and Analysis , 1985, IEEE Transactions on Software Engineering.

[37]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[38]  Ravishankar K. Iyer,et al.  Experimental analysis of computer system dependability , 1996 .

[39]  Mark Sullivan,et al.  A comparison of software defects in database management systems and operating systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[40]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[41]  John D. Musa,et al.  Software Reliability Engineering , 1998 .

[42]  Paola Velardi,et al.  A Study of Software Failures and Recovery in the MVS Operating System , 1984, IEEE Transactions on Computers.

[43]  Darrell D. E. Long,et al.  A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.