Measurement of failure rate in widely distributed software

In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved.<<ETX>>

[1]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[2]  G. Q. Kenny Estimating defects in commercial software during operational use , 1993 .

[3]  Karama Kanoun,et al.  Software dependability of a telephone switching system. , 1987 .

[4]  Samiha Mourad,et al.  On the Reliability of the IBM MVS/XA Operating System , 1987, IEEE Transactions on Software Engineering.

[5]  I. Good,et al.  Distribution of Word Frequencies , 1957, Nature.

[6]  Edward N. Adams,et al.  Optimizing Preventive Service of Software Products , 1984, IBM J. Res. Dev..

[7]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[8]  Mladen A. Vouk,et al.  Measuring the field quality of wide-distribution commercial software , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[9]  Mladen A. Vouk,et al.  On operational availability of a large software-based telecommunications system , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[10]  Paola Velardi,et al.  Hardware-Related Software Errors: Measurement and Analysis , 1985, IEEE Transactions on Software Engineering.

[11]  Ravishankar K. Iyer,et al.  Recognition of Error Symptoms in Large Systems , 1986, FJCC.

[12]  Ravishankar K. Iyer,et al.  Identifying software problems using symptoms , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[13]  Ram Chillarege,et al.  Software recreate problems estimated to range 10-20 percent: A case study on two operating system products , 1993, Proceedings of 1993 IEEE International Symposium on Software Reliability Engineering.

[14]  Hermann Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992 .

[15]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[16]  Michael Francis Buckley Computer event monitoring and analysis , 1992 .