Architecture-Based Run-Time Fault Diagnosis

An important step in achieving robustness to run-time faults is the ability to detect and repair problems when they arise in a running system. Effective fault detection and repair could be greatly enhanced by run-time fault diagnosis and localization, since it would allow the repair mechanisms to focus adaptation effort on the parts most in need of attention. In this paper we describe an approach to run-time fault diagnosis that combines architectural models with spectrum-based reasoning for multiple fault localization. Spectrum-based reasoning is a lightweight technique that takes a form of trace abstraction and produces a list (ordered by probability) of likely fault candidates. We show how this technique can be combined with architectural models to support run-time diagnosis that can (a) scale to modern distributed software systems; (b) accommodate the use of black-box components and proprietary infrastructure for which one has neither a specification nor source code; and (c) handle inherent uncertainty about the probable cause of a problem even in the face of transient faults and faults that arise only when certain combinations of system components interact.

[1]  Gregg Rothermel,et al.  An empirical investigation of the relationship between spectra differences and regression faults , 2000, Softw. Test. Verification Reliab..

[2]  John T. Stasko,et al.  Visualization of test information to assist fault localization , 2002, ICSE '02.

[3]  Tommi Mikkonen,et al.  Run-time monitoring of architecturally significant behaviors using behavioral profiles and aspects , 2006, ISSTA '06.

[4]  Brian C. Williams,et al.  Diagnosing Multiple Faults , 1987, Artif. Intell..

[5]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  Marko Palviainen,et al.  The reliability estimation, prediction and measuring of component-based software , 2011, J. Syst. Softw..

[7]  Peter Zoeteweij,et al.  Spectrum-Based Multiple Fault Localization , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[8]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[9]  Hong Yan,et al.  Discovering Architectures from Running Systems , 2006, IEEE Transactions on Software Engineering.

[10]  A. V. Gemund,et al.  Diagnosing Intermittent Faults , 2008 .

[11]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[12]  Kishor S. Trivedi,et al.  Software Aging and Rejuvenation , 2007, Wiley Encyclopedia of Computer Science and Engineering.

[13]  A.J.C. van Gemund,et al.  On the Accuracy of Spectrum-based Fault Localization , 2007, Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007).

[14]  GhemawatSanjay,et al.  The Google file system , 2003 .

[15]  Marija Mikic-Rakic,et al.  Architectural style requirements for self-healing systems , 2002, WOSS '02.

[16]  Jeff Magee,et al.  A Rigorous Architectural Approach to Adaptive Software Engineering , 2009, Journal of Computer Science and Technology.

[17]  Peter Zoeteweij,et al.  An observation-based model for fault localization , 2008, WODA.

[18]  Bradley R. Schmerl,et al.  Rainbow: architecture-based self-adaptation with reusable infrastructure , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[19]  Bradley R. Schmerl,et al.  Rainbow: Architecture-Based Self-Adaptation with Reusable Infrastructure , 2004, Computer.

[20]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[21]  David Garlan,et al.  Rainbow: architecture-based self-adaptation with reusable infrastructure , 2004 .

[22]  Muhammad Ali Babar,et al.  Proceedings of the 4th European conference on Software architecture , 2010 .

[23]  Debanjan Ghosh,et al.  Self-healing systems - survey and synthesis , 2007, Decis. Support Syst..

[24]  Bradley R. Schmerl,et al.  Architecture-based self-adaptation in the presence of multiple objectives , 2006, SEAMS '06.

[25]  Joseph Robert Horgan,et al.  Dynamic program slicing , 1990, PLDI '90.

[26]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[27]  Benjamin W. Wah,et al.  Wiley Encyclopedia of Computer Science and Engineering , 2009, Wiley Encyclopedia of Computer Science and Engineering.

[28]  Markus Stumptner,et al.  Evaluating Models for Model-Based Debugging , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[29]  Rui Abreu,et al.  Diagnosing multiple intermittent failures using maximum likelihood estimation , 2010, Artif. Intell..