Failure Diagnosis for Cluster Systems using Partial Correlations

Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previously unknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.

[1]  Frank Mueller,et al.  Systemic Assessment of Node Failures in HPC Production Platforms , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[2]  Shilin He,et al.  Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[3]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[4]  Arshad Jhumka,et al.  Using Message Logs and Resource Use Data for Cluster Failure Diagnosis , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[5]  James C. Browne,et al.  Enabling comprehensive data-driven system management for large computational facilities , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Bianca Schroeder,et al.  The Computer Failure Data Repository (CFDR): collecting, sharing and analyzing failure data , 2006, SC.

[7]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[8]  Franck Cappello,et al.  LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[9]  Alan Agresti,et al.  Statistics: The Art and Science of Learning from Data , 2005 .

[10]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[11]  Arshad Jhumka,et al.  Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis , 2019, J. Parallel Distributed Comput..

[12]  Fengxi Song,et al.  Feature Selection Using Principal Component Analysis , 2010, 2010 International Conference on System Science, Engineering Design and Manufacturing Informatization.

[13]  R. Shibata,et al.  PARTIAL CORRELATION AND CONDITIONAL CORRELATION AS MEASURES OF CONDITIONAL INDEPENDENCE , 2004 .

[14]  Joachim Selbig,et al.  Non-linear PCA: a missing data approach , 2005, Bioinform..

[15]  Bernd Hamann,et al.  Scrubjay: Deriving Knowledge from the Disarray of HPC Performance Data , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Merve Astekin,et al.  DILAF: A framework for distributed analysis of large‐scale system logs for anomaly detection , 2018, Softw. Pract. Exp..

[17]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[18]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Zhiling Lan,et al.  3-Dimensional root cause diagnosis via co-analysis , 2012, ICAC '12.

[20]  Neeraj Suri,et al.  On-Line Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters , 2007, IEEE Transactions on Dependable and Secure Computing.

[21]  Song Fu,et al.  Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[22]  Jelle J. Goeman,et al.  Multiple hypothesis testing in genomics , 2014, Statistics in medicine.

[23]  Arshad Jhumka,et al.  Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis , 2019, 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom).

[24]  Wei Xu,et al.  What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[25]  Matthias Scholz,et al.  Validation of Nonlinear PCA , 2012, Neural Processing Letters.

[26]  Chokchai Leangsuksun,et al.  Baler: deterministic, lossless log message clustering tool , 2011, Computer Science - Research and Development.

[27]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[28]  Arshad Jhumka,et al.  Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[29]  Tommy Minyard,et al.  End-to-end framework for fault management for open source clusters: Ranger , 2010, TG.

[30]  James C. Browne,et al.  Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources , 2015, Computing in Science & Engineering.

[31]  Guy Feldman,et al.  A Principled Approach to HPC Event Monitoring , 2015, FTXS@HPDC.

[32]  Saurabh Jha,et al.  The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems , 2020, 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[33]  Weisong Shi,et al.  Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[34]  Heiko Hoffmann,et al.  Kernel PCA for novelty detection , 2007, Pattern Recognit..

[35]  James C. Browne,et al.  Understanding Application and System Performance Through System-Wide Monitoring , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[36]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[37]  Christian Engelmann,et al.  A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[38]  Edward Chuah,et al.  Online failure prediction for HPC resources using decentralized clustering , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[39]  Franck Cappello,et al.  Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[40]  Alexander Aiken,et al.  Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[41]  Raymond H. Myers,et al.  Probability and Statistics for Engineers and Scientists. , 1973 .

[42]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Song Fu,et al.  Anomaly detection in large-scale coalition clusters for dependability assurance , 2010, 2010 International Conference on High Performance Computing.