Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Recent availability of resource use data provides another potentially useful source of data for failure detection and diagnosis. Early work combining message logs and resource use data for failure diagnosis has shown promising results. This paper describes the CRUMEL framework which implements a new approach to combining rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and correlates these patterns by time with system failures. Application of CRUMEL to data from the Ranger supercomputer has yielded improved diagnoses over previous research. CRUMEL has: (i) showed that more events correlated with system failures can only be identified by applying different correlation algorithms, (ii) confirmed six groups of errors, (iii) identified Lustre I/O resource use counters which are correlated with occurrence of Lustre faults which are potential flags for online detection of failures, (iv) matched the dates of correlated error events and correlated resource use with the dates of compute node hang-ups and (v) identified two more error groups associated with compute node hang-ups. The pre-processed data will be put on the public domain in September, 2016.

[1]  John P. Rouillard Real-time Log File Analysis Using the Simple Event Correlator (SEC) , 2004, LISA.

[2]  Arshad Jhumka,et al.  Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[3]  Arshad Jhumka,et al.  Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[4]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[5]  Alexander Aiken,et al.  Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[6]  Arshad Jhumka,et al.  Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[7]  Zhi-Li Zhang,et al.  Extracting the textual and temporal structure of supercomputing logs , 2009, 2009 International Conference on High Performance Computing (HiPC).

[8]  Thomas Reidemeister,et al.  Diagnosis of recurrent faults using log files , 2009, CASCON.

[9]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10]  Edward Chuah,et al.  Online failure prediction for HPC resources using decentralized clustering , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[11]  Arshad Jhumka,et al.  CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[12]  Song Fu,et al.  Anomaly detection in large-scale coalition clusters for dependability assurance , 2010, 2010 International Conference on High Performance Computing.

[13]  Evangelos E. Milios,et al.  Clustering event logs using iterative partitioning , 2009, KDD.

[14]  Saharon Rosset,et al.  Analyzing system logs: a new view of what's important , 2007 .

[15]  Zhiling Lan,et al.  3-Dimensional root cause diagnosis via co-analysis , 2012, ICAC '12.

[16]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[17]  Tommy Minyard,et al.  End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.

[18]  Felix Salfner,et al.  Cross-core event monitoring for processor failure prediction , 2009, 2009 International Conference on High Performance Computing & Simulation.

[19]  Risto Vaarandi,et al.  Mining event logs with SLCT and LogHound , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[20]  Bronis R. de Supinski,et al.  Automatic fault characterization via abnormality-enhanced classification , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[21]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[22]  Raymond H. Myers,et al.  Probability and Statistics for Engineers and Scientists. , 1973 .

[23]  Felix Salfner,et al.  Error Log Processing for Accurate Failure Prediction , 2008, WASL.

[24]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.