CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems

The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%.

[1]  Evangelos E. Milios,et al.  An Evaluation of Entropy Based Approaches to Alert Detection in High Performance Cluster Logs , 2010, 2010 Seventh International Conference on the Quantitative Evaluation of Systems.

[2]  Arshad Jhumka,et al.  Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[3]  Ling Huang,et al.  Mining Console Logs for Large-Scale System Problem Detection , 2008, SysML.

[4]  A. Nur Zincir-Heywood,et al.  Fast entropy based alert detection in super computer logs , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[5]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[6]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[7]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[8]  Zhiling Lan,et al.  Exploring void search for fault detection on extreme scale systems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[9]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[10]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[11]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[12]  Evangelos E. Milios,et al.  System State Discovery Via Information Content Clustering of System Logs , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[13]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[14]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Zhenbang Chen,et al.  Identifying faults in large-scale distributed systems by filtering noisy error logs , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[16]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[17]  Tommy Minyard,et al.  End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.

[18]  Yuh-Jye Lee,et al.  Anomaly Detection via Online Oversampling Principal Component Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[19]  Arshad Jhumka,et al.  Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[20]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[21]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.