Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems

The ability to automatically detect faults or fault patterns to enhance system reliability is important for system administrators in reducing system failures. To achieve this objective, the message logs from cluster system are augmented with failure information, i.e., The raw log data is labelled. However, tagging or labelling of raw log data is very costly. In this paper, our objective is to detect failure patterns in the message logs using unlabelled data. To achieve our aim, we propose a methodology whereby a pre-processing step is first performed where redundant data is removed. A clustering algorithm is then executed on the resulting logs, and we further developed an unsupervised algorithm to detect failure patterns in the clustered log by harnessing the characteristics of these sequences. We evaluated our methodology on large production data, and results shows that, on average, an f-measure of 78% can be obtained without having data labels. The implication of our methodology is that a system administrator with little knowledge of the system can detect failure runs with reasonably high accuracy.

[1]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[2]  Miroslaw Malek,et al.  Comprehensive logfiles for autonomic systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[4]  Michal Aharon,et al.  One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs , 2009, ECML/PKDD.

[5]  P. N. Suganthan,et al.  Differential Evolution: A Survey of the State-of-the-Art , 2011, IEEE Transactions on Evolutionary Computation.

[6]  Edward Chuah,et al.  Diagnosing the root-causes of failures from cluster log files , 2010, 2010 International Conference on High Performance Computing.

[7]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[8]  Saharon Rosset,et al.  Analyzing system logs: a new view of what's important , 2007 .

[9]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[10]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[11]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[12]  Edward Chuah,et al.  Establishing Hypothesis for Recurrent System Failures from Cluster Log Files , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[13]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[14]  Tommy Minyard,et al.  End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.

[15]  Ravishankar K. Iyer,et al.  Recognition of Error Symptoms in Large Systems , 1986, FJCC.

[16]  Franck Cappello,et al.  Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.

[17]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[20]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[21]  AvizienisAlgirdas,et al.  Basic Concepts and Taxonomy of Dependable and Secure Computing , 2004 .

[22]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.

[23]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[24]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[25]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[26]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[27]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[28]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.