Failure data analysis of a large-scale heterogeneous server environment

The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and their empirical and statistical properties. This understanding can help evaluate the effectiveness of different techniques for improving system availability, in addition to developing new solutions. In this paper, we analyze the empirical and statistical properties of system errors and failures from a network of nearly 400 heterogeneous servers running a diverse workload over a year. While improvements in system robustness continue to limit the number of actual failures to a very small fraction of the recorded errors, the failure rates are significant and highly variable. Our results also show that the system error and failure patterns are comprised of time-varying behavior containing long stationary intervals. These stationary intervals exhibit various strong correlation structures and periodic patterns, which impact performance but also can be exploited to address such performance issues.

[1]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[2]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[3]  Ravishankar K. Iyer,et al.  Effect of System Workload on Operating System Reliability: A Study on IBM 3081 , 1985, IEEE Transactions on Software Engineering.

[4]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[5]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[6]  J. Kingman A FIRST COURSE IN STOCHASTIC PROCESSES , 1967 .

[7]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[8]  Ravishankar K. Iyer,et al.  Software Dependability in the Tandem GUARDIAN System , 1995, IEEE Trans. Software Eng..

[9]  Lu Wei,et al.  Analysis of workload influence on dependability , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Michael R. Lyu,et al.  Software fault tolerance in a clustered architecture: techniques and reliability modeling , 1999, 1999 IEEE Aerospace Conference. Proceedings (Cat. No.99TH8403).

[11]  Samuel Karlin,et al.  A First Course on Stochastic Processes , 1968 .

[12]  Ravishankar K. Iyer,et al.  Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.

[13]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[14]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[15]  Ravishankar K. Iyer,et al.  Impact of Correlated Failures on Dependability in a VAXcluster System , 1992 .

[16]  J. Raz,et al.  Automatic Statistical Analysis of Bivariate Nonstationary Time Series , 2001 .

[17]  Samiha Mourad,et al.  On the Reliability of the IBM MVS/XA Operating System , 1987, IEEE Transactions on Software Engineering.