Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning

With increasing size and complexity of Grids manual diagnosis of individual application faults becomes impractical and time-consuming. Quick and accurate identification of the root cause of failures is an important prerequisite for building reliable systems. We describe a pragmatic model-based technique for application-specific fault diagnosis based on indicators, symptoms and rules. Customized wrapper services then apply this knowledge to reason about root causes of failures. In addition to user-provided diagnosis models we show that given a set of past classified fault events it is possible to extract new models through learning that are able to diagnose new faults. We investigated and compared algorithms of supervised classification learning and cluster analysis. Our approach was implemented as part of the Otho Toolkit that 'service-enables' legacy applications based on synthesis of wrapper service.

[1]  Miron Livny,et al.  Phoenix: making data-intensive grid applications fault-tolerant , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[2]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[3]  Takashi Chikayama,et al.  A scalable and efficient self-organizing failure detector for grid applications , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[4]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[6]  Thomas Fahringer,et al.  Specification-based Synthesis of Tailor-made Grid Service Wrappers for Scientific Legacy Codes , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[7]  David R. Kuhn Fault Classes and Error Detection in Specification Based Testing | NIST , 1998 .

[8]  Peter H. Beckman,et al.  The Inca Test Harness and Reporting Framework , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[9]  Bertrand Meyer,et al.  Specification Language , 1980, On the Construction of Programs.

[10]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[11]  Barton P. Miller,et al.  Scalable systems software - Problem diagnosis in large-scale computing environments , 2006, SC.

[12]  Thomas Fahringer,et al.  Presenting Scientific Legacy Programs as Grid Services via Program Synthesis , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[13]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[14]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[15]  Cliff B. Jones,et al.  Systematic software development using VDM , 1986, Prentice Hall International Series in Computer Science.

[16]  P. K. Aditya,et al.  A Grammar Based Fault Classification Scheme and its Application to the Classification of the Errors , 1995 .

[17]  Barton P. Miller,et al.  Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  Francisco Vilar Brasileiro,et al.  Collaborative fault diagnosis in grids through automated tests , 2006, 20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06).

[20]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[21]  D. Richard Kuhn Fault classes and error detection capability of specification-based testing , 1999, TSEM.

[22]  Frank Ortmeier,et al.  Failure-Sensitive Specification A formal method for finding failure modes , 2004 .

[23]  Thomas Fahringer,et al.  The Otho Toolkit - Synthesizing tailor-made scientific grid application wrapper services , 2007, Multiagent Grid Syst..

[24]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[25]  James M. Rehg,et al.  Active learning for automatic classification of software behavior , 2004, ISSTA '04.

[26]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[27]  Ib Holm Sørensen A Specification Language , 1981, Program Specification.

[28]  Ian Witten,et al.  Data Mining , 2000 .

[29]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[30]  Thomas Fahringer,et al.  Towards Automated Diagnosis of Application Faults using Wrapper Services and Machine Learning , 2008 .

[31]  C. Kesselman,et al.  Fault Location in Grids Using Bayesian Belief Networks , 2002 .

[32]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .