A workload-based analysis of software aging, and rejuvenation

We present a hierarchical model for the analysis of proactive fault management in the presence of system resource leaks. At the low level of the model hierarchy is a degradation model in which we use a nonhomogeneous Markov chain to establish an explicit connection between resource leaks, and the failure rate. With the degradation model, we prove that the failure rate is asymptotically constant in the absence of resource leaks, and it is increasing as leaks occur & accumulate, which confirms the resource leaks as an aging source. The proactive fault management (PFM) is modeled at the higher level as a semi-Markov process. The PFM model takes as input the degradation analysis from the low-level model, and allows us to determine optimal rejuvenation schedules with respect to various system measures.

[1]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[2]  Kishor S. Trivedi,et al.  Performance And Reliability Analysis Of Computer Systems (an Example-based Approach Using The Sharpe Software , 1997, IEEE Transactions on Reliability.

[3]  Bojan Cukic,et al.  Software aging and multifractality of memory resources , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[4]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[5]  William Yurcik,et al.  Achieving Fault-Tolerant Software with Rejuvenation and Reconfiguration , 2001, IEEE Softw..

[6]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[7]  Cal Erickson,et al.  Memory leak detection in embedded systems , 2002 .

[8]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9]  J. Henessy,et al.  The Future of System Research , 1999 .

[10]  Kishor S. Trivedi,et al.  Analysis of software rejuvenation using Markov Regenerative Stochastic Petri Net , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[11]  Kishor S. Trivedi,et al.  An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[12]  Kishor S. Trivedi,et al.  Analysis of periodic preventive maintenance with general system failure distribution , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[13]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[14]  Kishor S. Trivedi,et al.  Adaptive software rejuvenation , 2004 .

[15]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[16]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[17]  John L. Hennessy,et al.  The Future of Systems Research , 1999, Computer.

[18]  Kishor S. Trivedi,et al.  Adaptive software rejuvenation: degradation model and rejuvenation scheme , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[19]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[20]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[21]  Yennun Huang,et al.  Two Techniques for Transient Software Error Recovery , 1994, Hardware and Software Architectures for Fault Tolerance.

[22]  E Marshall,et al.  Fatal error: how patriot overlooked a scud. , 1992, Science.

[23]  Boudewijn R. Haverkort,et al.  Performance and reliability analysis of computer systems: An example-based approach using the sharpe software package , 1998 .

[24]  Kishor S. Trivedi,et al.  A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[25]  David Patterson,et al.  Self-repairing computers. , 2003, Scientific American.

[26]  Tadashi Dohi,et al.  Analysis of software cost models with rejuvenation , 2000, Proceedings. Fifth IEEE International Symposium on High Assurance Systems Engineering (HASE 2000).

[27]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[28]  Kishor S. Trivedi,et al.  System availability with non-exponentially distributed outages , 2002, IEEE Trans. Reliab..

[29]  Kishor S. Trivedi,et al.  Analysis of Preventive Maintenance in Transactions Based Software Systems , 1998, IEEE Trans. Computers.

[30]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[31]  Tadashi Dohi,et al.  Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule , 2000, Proceedings. 2000 Pacific Rim International Symposium on Dependable Computing.

[32]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[33]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.