A measurement-based model for estimation of resource exhaustion in operational software systems

Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of "software aging", in which the state of the software system degrades with time, has been reported (S. Garg et al., 1998). The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure, or both. Earlier work in this area to detect aging and to estimate its effect on system resources did not take into account the system workload. In this paper, we propose a measurement-based model to estimate the rate of exhaustion of operating system resources both as a function of time and the system workload state. A semi-Markov reward model is constructed based on workload and resource usage data collected from the UNIX operating system. We first identify different workload states using statistical cluster analysis and build a state-space model. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource exhaustion in the different states. The model is then solved to obtain trends and the estimated exhaustion rates and the time-to-exhaustion for the resources. With the help of this measure, proactive fault management techniques such as "software rejuvenation" (Y. Huang et al., 1995) may be employed to prevent unexpected outages.

[1]  Ravishankar K. Iyer,et al.  Effect of System Workload on Operating System Reliability: A Study on IBM 3081 , 1985, IEEE Transactions on Software Engineering.

[2]  S K Trivedi,et al.  The Analysis of Computer Systems Using Markov Reward Processes , 1987 .

[3]  P. Sen Estimates of the Regression Coefficient Based on Kendall's Tau , 1968 .

[4]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[6]  Ram Chillarege,et al.  Measurement of failure rate in widely distributed software , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[8]  Frank Feather,et al.  A case study of Ethernet anomalies in a distributed computing environment , 1990 .

[9]  Ravishankar K. Iyer,et al.  Analyze-NOW-an environment for collection and analysis of failures in a network of workstations , 1996, IEEE Trans. Reliab..

[10]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[11]  Kishor S. Trivedi,et al.  A unified performance reliability analysis of a system with a cumulative down time constraint , 1992 .

[12]  Ravishankar K. Iyer,et al.  Dependability Measurement and Modeling of a Multicomputer System , 1993, IEEE Trans. Computers.

[13]  Kishor S. Trivedi,et al.  Performability Modeling Based on Real Data: A Case Study , 1988, IEEE Trans. Computers.

[14]  Ravishankar K. Iyer,et al.  Predictability of Process Resource Usage: A Measurement-Based Study on UNIX , 1989, IEEE Trans. Software Eng..

[15]  Ravishankar K. Iyer,et al.  Identifying software problems using symptoms , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[16]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[17]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[18]  R. O. Gilbert Statistical Methods for Environmental Pollution Monitoring , 1987 .

[19]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[20]  Kishor S. Trivedi,et al.  Performance And Reliability Analysis Of Computer Systems (an Example-based Approach Using The Sharpe Software , 1997, IEEE Transactions on Reliability.

[21]  Yennun Huang,et al.  Two Techniques for Transient Software Error Recovery , 1994, Hardware and Software Architectures for Fault Tolerance.

[22]  Raymond Mariez,et al.  Performability Analysis Us ing Semi-Markov Reward Processes , 1990 .

[23]  Giuseppe Serazzi,et al.  Measurement and Tuning of Computer Systems , 1984, Int. CMG Conference.

[24]  Boudewijn R. Haverkort,et al.  Performance and reliability analysis of computer systems: An example-based approach using the sharpe software package , 1998 .