Diagnosing architectural run-time failures

Self-diagnosis is a fundamental capability of self-adaptive systems. In order to recover from faults, systems need to know which part is responsible for the incorrect behavior. In previous work we showed how to apply a design-time diagnosis technique at run time to identify faults at the architectural level of a system. Our contributions address three major shortcomings of our previous work: 1) we present an expressive, hierarchical language to describe system behavior that can be used to diagnose when a system is behaving different to expectation; the hierarchical language facilitates mapping low level system events to architecture level events; 2) we provide an automatic way to determine how much data to collect before an accurate diagnosis can be produced; and 3) we develop a technique that allows the detection of correlated faults between components. Our results are validated experimentally by injecting several failures in a system and accurately diagnosing them using our algorithm.

[1]  Rui Abreu,et al.  A Low-Cost Approximate Minimal Hitting Set Algorithm and its Application to Model-Based Diagnosis , 2009, SARA.

[2]  Marko Palviainen,et al.  The reliability estimation, prediction and measuring of component-based software , 2011, J. Syst. Softw..

[3]  Markus Stumptner,et al.  Evaluating Models for Model-Based Debugging , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[4]  Peter Zoeteweij,et al.  Spectrum-Based Multiple Fault Localization , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[5]  Gregg Rothermel,et al.  An empirical investigation of the relationship between spectra differences and regression faults , 2000, Softw. Test. Verification Reliab..

[6]  Rui Abreu,et al.  Diagnosing multiple intermittent failures using maximum likelihood estimation , 2010, Artif. Intell..

[7]  Kishor S. Trivedi,et al.  Software Aging and Rejuvenation , 2007, Wiley Encyclopedia of Computer Science and Engineering.

[8]  Jeff Magee,et al.  A Rigorous Architectural Approach to Adaptive Software Engineering , 2009, Journal of Computer Science and Technology.

[9]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[10]  Peter Zoeteweij,et al.  An observation-based model for fault localization , 2008, WODA.

[11]  GhemawatSanjay,et al.  The Google file system , 2003 .

[12]  David Garlan,et al.  Acme: architectural description of component-based systems , 2000 .

[13]  Joseph Robert Horgan,et al.  Dynamic program slicing , 1990, PLDI '90.

[14]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[15]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[16]  David Sinreich,et al.  An architectural blueprint for autonomic computing , 2006 .

[17]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[18]  John T. Stasko,et al.  Visualization of test information to assist fault localization , 2002, ICSE '02.

[19]  Peter Zoeteweij,et al.  A New Bayesian Approach to Multiple Intermittent Fault Diagnosis , 2009, IJCAI.

[20]  David Garlan,et al.  Rainbow: architecture-based self-adaptation with reusable infrastructure , 2004 .

[21]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[22]  Bradley R. Schmerl,et al.  Architecture-Based Run-Time Fault Diagnosis , 2011, ECSA.

[23]  Bradley R. Schmerl,et al.  Architecture-based self-adaptation in the presence of multiple objectives , 2006, SEAMS '06.

[24]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[25]  A. V. Gemund,et al.  Diagnosing Intermittent Faults , 2008 .

[26]  Bradley R. Schmerl,et al.  Rainbow: architecture-based self-adaptation with reusable infrastructure , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[27]  Brian C. Williams,et al.  Diagnosing Multiple Faults , 1987, Artif. Intell..