Diagnosis of recurrent faults using log files

Enterprise software systems (ESS) are becoming larger and increasingly complex. Failure in business-critical systems is expensive, leading to consequences such as loss of critical data, loss of sales, customer dissatisfaction, even law suits. Therefore, detecting failures and diagnosing their root-cause in a timely manner is essential. Many studies suggest that a large fraction of failures encountered in practice are recurrent (i.e., they have been seen before). Fast and accurate detection of these failures can accelerate problem determination, and thereby improve system reliability. To this effect, we explore machine learning techniques, including the Naïve Bayes classifier, partially-supervised learning, and decision trees (using C4.5), to automatically recognize symptoms of recurrent faults and to derive detection rules from samples of log data. This work focuses on log files, since they are readily available and they do not put any additional computational burden on the component generating the data. The methods explored in this work can aid the development of tools to assist support personnel in problem determination tasks. Instead of requiring the operators to manually define patterns for identifying recurrent problems, such tools can be trained using prior, solved and unsolved cases from existing support databases.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[3]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[4]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.

[5]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[6]  Thomas Reidemeister,et al.  Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis , 2008, DSOM.

[7]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[8]  Sheng Ma,et al.  Automated Problem Determination Using Call-Stack Matching , 2005, Journal of Network and Systems Management.

[9]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[10]  Thomas Reidemeister,et al.  Information-theoretic modeling for tracking the health of complex software systems , 2008, CASCON '08.

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Soila Pertet,et al.  Causes of Failure in Web Applications (CMU-PDL-05-109) , 2005 .

[13]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[16]  Risto Vaarandi,et al.  A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[17]  Thomas Reidemeister,et al.  Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[18]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[19]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[20]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[21]  Bert Wijnen,et al.  An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks , 2002, RFC.

[22]  Joseph L. Hellerstein,et al.  Discovering actionable patterns in event data , 2002, IBM Syst. J..

[23]  Wei Peng,et al.  An integrated framework on mining logs files for computing system management , 2005, KDD '05.

[24]  Ian Witten,et al.  Data Mining , 2000 .

[25]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[26]  John Wang,et al.  Data Mining Software , 2008 .

[27]  S. Masoud Sadjadi,et al.  Data Mining for Autonomic System Management: A Case Study at FIU-SCIS , 2006 .