Establishing Hypothesis for Recurrent System Failures from Cluster Log Files

A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is difficult. This paper describes the second generation FDiag log-based failure diagnostics framework that provides automation of the manual failure diagnosis process and determines with high confidence, the likely cause of the failure, the components involved and the event sequences which contain the times of the causal and terminal events. FDiag extracts relevant events from the system logs, performs correlation analysis on these events and from these correlations determines the components involved and the event sequences. The diagnostics capabilities of FDiag are validated by comparing its assessments on known instances of recurrent failures on the Ranger supercomputer at the University of Texas at Austin. We believe FDiag is the first log analyzer to demonstrate this level of diagnostics capability from the system logs of an open source software stack incorporating Linux and the Lustre file system. FDiag will be put into production use for support of failure diagnosis on Ranger in September, 2011.

[1]  Saharon Rosset,et al.  Analyzing system logs: a new view of what's important , 2007 .

[2]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[3]  Rajeev Thakur,et al.  A Fault Diagnosis and Prognosis Service for TeraGrid Clusters , 2007 .

[4]  Mohamed Kaâniche,et al.  Availability assessment of SunOS/Solaris Unix systems based on syslogd and wtmpx log files: A case study , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[5]  Hui Xiong,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, ICDM.

[6]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[7]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[8]  Rajeev Thakur,et al.  A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..

[9]  Tommy Minyard,et al.  End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.

[10]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[11]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[12]  Michal Aharon,et al.  One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs , 2009, ECML/PKDD.

[13]  Stephen E. Hansen,et al.  Automated System Monitoring and Notification with Swatch , 1993, LISA.

[14]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[15]  John Stearley,et al.  Towards informatic analysis of syslogs , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[16]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[17]  Ravishankar K. Iyer,et al.  Recognition of Error Symptoms in Large Systems , 1986, FJCC.

[18]  Rajeev Gandhi,et al.  Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[19]  Daniel P. Siewiorek,et al.  Models for time coalescence in event logs , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[20]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[21]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[22]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[23]  Rajeev Gandhi,et al.  Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[24]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[25]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[26]  Alexander Aiken,et al.  Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[27]  Edward Chuah,et al.  Diagnosing the root-causes of failures from cluster log files , 2010, 2010 International Conference on High Performance Computing.