Self-Caring IT Systems: A Proof-of-Concept Implementation in Virtualized Environments

In self-caring IT systems, faults are handled proactively, e.g. by slowing down the deterioration of system health thereby effectively avoiding or delaying system failures. This requires health management which entails health monitoring, diagnosis, prognosis, planning of recovery and remediation actions. A brief overview of our prior work, which proposes a general methodology to capture system properties and incorporate health management using Petri nets, is provided. We describe in detail an application of the proposed formal method to the design and development of middleware that can manage the health of a batch-based, job submission system on a virtualized platform. First, we describe how a real world job submission IT system is converted to a Petri net model. Secondly, we show system validation and analysis using this model to understand resource needs of different activities in the IT chain. Thirdly, we describe how the executable model is used as a system manager to control operation and health management of a virtualized test bed. Fourthly, we illustrate the use of a feedback controller to manage health deterioration due to resource depletion in the job-execution stage of the modeled IT chain. Using a proof-of-concept implementation, we show that the early detection and handling of health deteriorations results in significant benefits in terms of cost savings and down time reduction. Experimental results show that our health management framework can be used to effectively prevent job failures, while imposing low overhead to the managed system. We have shown that for a typical workload consisting of jobs that suffer from potential resource depletion faults, our feedback controller can be used to gain useful life that is needed for critical planning and remediation actions in up to 82% of the jobs.

[1]  Derek Coleman,et al.  Automation Controller for Operational IT Management , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[2]  Peter Radford,et al.  Petri Net Theory and the Modeling of Systems , 1982 .

[3]  Lin Yang,et al.  Investigating the use of autonomic cloudbursts for high-throughput medical image registration , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[4]  José A. B. Fortes,et al.  Towards IT Systems Capable of Managing Their Health , 2010, Monterey Workshop.

[5]  Aaron B. Brown,et al.  An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Application Environment , 2000 .

[6]  K. Shin,et al.  Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach , 2002, IEEE Trans. Parallel Distributed Syst..

[7]  MengChu Zhou,et al.  Adaptive design of Petri net controllers for error recovery in automated manufacturing systems , 1989, IEEE Trans. Syst. Man Cybern..

[8]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[9]  Paul Marshall,et al.  Elastic Site: Using Clouds to Elastically Extend Site Resources , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[10]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[11]  Jörg Desel,et al.  Free choice Petri nets , 1995 .

[12]  Antti Valmari,et al.  A stubborn attack on state explosion , 1990, Formal Methods Syst. Des..

[13]  Paulo Veríssimo,et al.  Detection and Prediction of Resource-Exhaustion Vulnerabilities , 2008, 2008 19th International Symposium on Software Reliability Engineering (ISSRE).

[14]  Kishor S. Trivedi,et al.  A measurement-based model for estimation of resource exhaustion in operational software systems , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[15]  Simon Dobson,et al.  Facilitating a Well-Founded Approach to Autonomic Systems , 2008, Fifth IEEE Workshop on Engineering of Autonomic and Autonomous Systems (ease 2008).

[16]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[17]  K. C. Gross,et al.  Proactive detection of software aging mechanisms in performance critical computers , 2002, 27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings..

[18]  Shujia Zhou,et al.  Case study for running HPC applications in public clouds , 2010, HPDC '10.

[19]  Kishor S. Trivedi,et al.  SPNP: stochastic Petri net package , 1989, Proceedings of the Third International Workshop on Petri Nets and Performance Models, PNPM89.

[20]  MengChu Zhou,et al.  Petri net synthesis for discrete event control of manufacturing systems , 1992, The Kluwer international series in engineering and computer science.

[21]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[22]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[23]  Frank L. Lewis,et al.  Intelligent Fault Diagnosis and Prognosis for Engineering Systems , 2006 .

[24]  Jim Conallen,et al.  Modeling Web application architectures with UML , 1999, CACM.

[25]  G. Bruce Berriman,et al.  Scientific workflow applications on Amazon EC2 , 2010, 2009 5th IEEE International Conference on E-Science Workshops.

[26]  B. Pasik-Duncan,et al.  Adaptive Control , 1996, IEEE Control Systems.

[27]  MengChu Zhou,et al.  Modeling, Simulation, and Control of Flexible Manufacturing Systems - A Petri Net Approach , 1999, Series in Intelligent Control and Intelligent Automation.

[28]  Sandeep Neema,et al.  Verifying Autonomic Fault Mitigation Strategies in Large Scale Real-Time Systems , 2006, Third IEEE International Workshop on Engineering of Autonomic & Autonomous Systems (EASE'06).

[29]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[30]  Saurabh Bagchi,et al.  Dependency Analysis in Distributed Systems using Fault Injection: Application to Problem Determination in an e-commerce Environment , 2001, DSOM.

[31]  Catalina M. Lladó,et al.  PIPE v 2 . 5 : a Petri Net Tool for Performance Modeling , 2007 .