Software defects and their impact on system availability-a study of field failures in operating systems

Defects reported between 1986 and 1989 in the MVS operating system are studied in order to gain the insight needed to provide a clear strategy for avoiding or tolerating them. Typical defects (regular) are compared to those that corrupt a program's memory (overlay), given that overlays are considered by field services to be particularly hard to find and fix. It is shown that the impact of an overlay defect is, on average, much higher than that of a regular defect, that boundary conditions and allocation management are the major causes of overlay defects, not timing, and that most overlays are small and corrupt data near the data that the programmer meant to update. Further analysis is provided on defects in fixes to other defects, failure symptoms, and the impact of defects on customers.<<ETX>>

[1]  Ravishankar K. Iyer,et al.  Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.

[2]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation0 , 1984, CACM.

[3]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation , 1993 .

[4]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[5]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[6]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Ram Chillarege,et al.  Defect type and its impact on the growth curve (software development) , 1991, [1991 Proceedings] 13th International Conference on Software Engineering.

[8]  Jean Arlat,et al.  Fault injection for dependability validation of fault-tolerant computing systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[10]  Albert Endres,et al.  An analysis of errors and their causes in system programs , 1975, IEEE Transactions on Software Engineering.

[11]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[12]  Ytzhak H. Levendel,et al.  Defects and reliability analysis of large software systems: field experience , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Samiha Mourad,et al.  On the Reliability of the IBM MVS/XA Operating System , 1987, IEEE Transactions on Software Engineering.

[14]  Robert L. Glass,et al.  Persistent Software Errors , 1981, IEEE Transactions on Software Engineering.

[15]  Daniel P. Siewiorek,et al.  Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .