Diagnosing the root-causes of failures from cluster log files

System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of a given failure. Furthermore, the process of troubleshooting the causes of failures is largely manual and ad-hoc. In this paper, we present a systematic methodology for reconstructing event order and establishing correlations among events which indicate the root-causes of a given failure from very large syslogs. We developed a diagnostics tool, FDiag, to extract the log entries as structured message templates and uses statistical correlation analysis to establish probable cause and effect relationships for the fault being analyzed. We applied FDiag to analyze failures due to br eakdowns in interactions between the Lustre file system and its clients on the Ranger supercomputer at the Texas Advanced Computing Center (TACC). The results are positive. FDiag is able to identify the dates and the time periods that contain the significant events which eventually led to the occurrence of compute node soft lockups.

[1]  Wei Peng,et al.  A Clustering Model Based on Matrix Approximation with Applications to Cluster System Log Files , 2005, ECML.

[2]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[3]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[4]  Risto Vaarandi,et al.  Mining event logs with SLCT and LogHound , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[5]  Alan Agresti,et al.  Statistics: The Art and Science of Learning from Data , 2005 .

[6]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.

[7]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[8]  Felix Salfner,et al.  Error Log Processing for Accurate Failure Prediction , 2008, WASL.

[9]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[10]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[11]  Yin Zhang,et al.  Troubleshooting chronic conditions in large IP networks , 2008, CoNEXT '08.

[12]  Qi Zhao,et al.  Towards automated performance diagnosis in a large IPTV network , 2009, SIGCOMM '09.

[13]  Ravishankar K. Iyer,et al.  Recognition of Error Symptoms in Large Systems , 1986, FJCC.

[14]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[15]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[16]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[17]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[18]  John Stearley,et al.  Towards informatic analysis of syslogs , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[19]  Zhi-Li Zhang,et al.  Extracting the textual and temporal structure of supercomputing logs , 2009, 2009 International Conference on High Performance Computing (HiPC).

[20]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[21]  Michal Aharon,et al.  One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs , 2009, ECML/PKDD.

[22]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[23]  Evangelos E. Milios,et al.  Clustering event logs using iterative partitioning , 2009, KDD.

[24]  John P. Rouillard Real-time Log File Analysis Using the Simple Event Correlator (SEC) , 2004, LISA.

[25]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[26]  Saharon Rosset,et al.  Analyzing system logs: a new view of what's important , 2007 .

[27]  Stephen E. Hansen,et al.  Automated System Monitoring and Notification with Swatch , 1993, LISA.

[28]  Sébastien Tricaud,et al.  Picviz: Finding a Needle in a Haystack , 2008, WASL.