A comprehensive model for software rejuvenation

Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.

[1]  Kishor S. Trivedi,et al.  Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.

[2]  Yennun Huang,et al.  Two Techniques for Transient Software Error Recovery , 1994, Hardware and Software Architectures for Fault Tolerance.

[3]  Nancy G. Leveson,et al.  An experimental evaluation of the assumption of independence in multiversion programming , 1986, IEEE Transactions on Software Engineering.

[4]  Tadashi Dohi,et al.  Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule , 2000, Proceedings. 2000 Pacific Rim International Symposium on Dependable Computing.

[5]  Kishor S. Trivedi,et al.  Optimal rejuvenation for tolerating soft failures , 1996 .

[6]  Kishor S. Trivedi,et al.  Performance and Reliability Analysis of Computer Systems , 1996, Springer US.

[7]  Matteo Sereno,et al.  Fine Grained Software Degradation Models for Optimal Rejuvenation Policies , 2001, Perform. Evaluation.

[8]  Kishor S. Trivedi,et al.  An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[9]  P. Dasgupta,et al.  Preemptive Module Replacement Using the virtualizing Operating System , 2002 .

[10]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[11]  Ravishankar K. Iyer,et al.  Software Dependability in the Tandem GUARDIAN System , 1995, IEEE Trans. Software Eng..

[12]  Kishor S. Trivedi,et al.  Optimal Software Rejuvenation for Tolerating Soft Failures , 1996, Perform. Evaluation.

[13]  Kishor S. Trivedi,et al.  Adaptive software rejuvenation: degradation model and rejuvenation scheme , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[14]  D. Woolley The White Paper. , 1972, British medical journal.

[15]  Kishor S. Trivedi,et al.  Performance and reliability evaluation of passive replication schemes in application level fault tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[16]  Kishor S. Trivedi,et al.  Modeling and Analysis of Software Rejuvenation in Cable Modem Termination System , 2003 .

[17]  Elaine J. Weyuker,et al.  Monitoring Smoothly Degrading Systems for Increased Dependability , 2004, Empirical Software Engineering.

[18]  David Powell,et al.  Distributed fault tolerance: lessons from Delta-4 , 1994, IEEE Micro.

[19]  Christof Fetzer,et al.  Rejuvenation and failure detection in partitionable systems , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[20]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[21]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[22]  Kishor S. Trivedi,et al.  Analysis of Preventive Maintenance in Transactions Based Software Systems , 1998, IEEE Trans. Computers.

[23]  Miklós Telek,et al.  An effective numerical method to compute the moments of the completion time of Markov reward models , 1998 .

[24]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[25]  Kishor S. Trivedi,et al.  Performance And Reliability Analysis Of Computer Systems (an Example-based Approach Using The Sharpe Software , 1997, IEEE Transactions on Reliability.

[26]  Ram Chillarege,et al.  Measurement of failure rate in widely distributed software , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[27]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[28]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[29]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[30]  Ravishankar K. Iyer,et al.  Analyze-NOW-an environment for collection and analysis of failures in a network of workstations , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[31]  Kishor S. Trivedi,et al.  Analysis of software rejuvenation using Markov Regenerative Stochastic Petri Net , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[32]  Wei Xie,et al.  Software rejuvenation policies for cluster systems under varying workload , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[33]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[34]  Kishor S. Trivedi,et al.  Performability Modeling Based on Real Data: A Case Study , 1988, IEEE Trans. Computers.

[35]  A. T. Tai,et al.  On-board preventive maintenance: analysis of effectiveness and optimal duty period , 1997, Proceedings Third International Workshop on Object-Oriented Real-Time Dependable Systems.

[36]  Brian Randell,et al.  Fundamental Concepts of Dependability , 2000 .

[37]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[38]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[39]  Boudewijn R. Haverkort,et al.  Performance and reliability analysis of computer systems: An example-based approach using the sharpe software package , 1998 .

[40]  Andy J. Wellings,et al.  GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[41]  P. Sen Estimates of the Regression Coefficient Based on Kendall's Tau , 1968 .

[42]  Kishor S. Trivedi,et al.  A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[43]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[44]  Kishor S. Trivedi,et al.  Modeling and analysis of software rejuvenation in cable modem termination systems , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[45]  E Marshall,et al.  Fatal error: how patriot overlooked a scud. , 1992, Science.

[46]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[47]  Kenny C. Gross,et al.  Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers , 2002, Proceedings International Conference on Dependable Systems and Networks.

[48]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.