Using machine learning for non-intrusive modeling and prediction of software aging

The wide-spread phenomenon of software (running image) aging is known to cause performance degradation, transient failures or even crashes of applications. In this work we describe first a method for monitoring and modeling of performance degradation in SOA applications, particularly application servers. This method works for a large class of the aging processes caused by resource depletion (e.g. memory leaks). It can be deployed non-intrusively in a production environment, under arbitrary service request distributions. Based on this schema we investigate in the second part of the paper how machine learning (classification) algorithms can be used for proactive detection of performance degradation or sudden drops caused by aging. We leverage the predictive power of these algorithms with several techniques to make the measurement-based aging models more adaptive and more robust against transient failures. We evaluate several state-of-the-art classification methods for their accuracy and computational efficiency in this scenario. The studies are performed on a data set generated by a TPC-W benchmark instrumented with a memory leak injector. The results show that the probing method yields accurate aging models with low overhead and the machine learning approach gives statistically significant short-term predictions of degrading application performance. Both approaches can be used directly to fight aging via adaptive software rejuvenation (restart of the application), for operator alerting, or for short-term capacity planning.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[4]  Martin F. Arlitt,et al.  Web server workload characterization: the search for invariants , 1996, SIGMETRICS '96.

[5]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[6]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[7]  Kishor S. Trivedi,et al.  An approach for estimation of software aging in a Web server , 2002, Proceedings International Symposium on Empirical Software Engineering.

[8]  C. Amza,et al.  Specification and implementation of dynamic Web site benchmarks , 2002, 2002 IEEE International Workshop on Workload Characterization.

[9]  Luís Moura Silva,et al.  Deterministic Models of Software Aging and Optimal Rejuvenation Schedules , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[10]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[11]  Luís Moura Silva,et al.  Managing Performance of Aging Applications Via Synchronized Replica Rejuvenation , 2007, DSOM.

[12]  K. C. Gross,et al.  Proactive detection of software aging mechanisms in performance critical computers , 2002, 27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings..

[13]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[14]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Kishor S. Trivedi,et al.  Optimal Software Rejuvenation for Tolerating Soft Failures , 1996, Perform. Evaluation.

[17]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[18]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .