Failure prediction for HPC systems and applications

As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim to minimize fault’s effects on applications. By far the most popular technique is the checkpoint–restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. One way of offering prediction is by the analysis of system logs generated during production by large-scale systems. Current research in this field presents a number of limitations that make them unusable for running on real production high-performance computing (HPC) systems. Based on our observations that different failures have different distributions and behaviours, we propose a novel hybrid approach that combines signal analysis with data mining in order to overcome current limitations. We show that by analysing each event according to its specific behaviour, our prediction provides a precision of over 90% and its able to discover about 50% of all failures in a system, result which allows its integration in proactive fault tolerance protocols.

[1]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[2]  Collin McCurdy,et al.  Early evaluation of IBM BlueGene/P , 2008, HiPC 2008.

[3]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[5]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[7]  Franck Cappello,et al.  Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.

[8]  William Farr,et al.  Software reliability modeling survey , 1996 .

[9]  Major Scott G. Frickenstein Reliability Theory with Applications to Preventive Maintenance llya Gertsbakh Springer-Verlag, 219 pp., ISBN 3-540-67275-3 , 2002 .

[10]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[11]  Hai Qiu,et al.  Physics-based Remaining Useful Life Prediction for Aircraft Engine Bearing Prognosis , 2009 .

[12]  Ling Huang,et al.  Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[13]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[14]  Qiang Fu,et al.  Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.

[15]  Michael Tortorella,et al.  Reliability Theory: With Applications to Preventive Maintenance , 2001, Technometrics.

[16]  Uday Kumar,et al.  FAILURE PREDICTION OF RAIL CONSIDERING ROLLING CONTACT FATIGUE , 2010 .

[17]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Pedro Trancoso,et al.  Trends in High-Performance Computing , 2011, Computing in Science & Engineering.

[19]  Zhiling Lan,et al.  Anomaly localization in large-scale clusters , 2007, 2007 IEEE International Conference on Cluster Computing.

[20]  Dorothy M. Andrews,et al.  A Methodology for Analysis of Failure Prediction Data , 1985, RTSS.

[21]  Dhabaleswar K. Panda,et al.  Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[22]  Joseph L. Hellerstein,et al.  Predictive algorithms in the management of computer systems , 2002, IBM Syst. J..

[23]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[24]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[25]  William Gropp,et al.  Exascale Research: Preparing for the Post-Moore Era , 2011 .

[26]  F. Al-Shamali,et al.  Author Biographies. , 2015, Journal of social work in disability & rehabilitation.

[27]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[28]  Vanish Talwar,et al.  Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[29]  Zhiling Lan,et al.  Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[30]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[32]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[33]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[34]  Jon Stearley,et al.  A State-Machine Approach to Disambiguating Supercomputer Event Logs , 2012, MAD.

[35]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[36]  Rajeev Thakur,et al.  A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..

[37]  SchroederBianca,et al.  Cosmic rays don't strike twice , 2012 .

[38]  Yves Robert,et al.  Impact of fault prediction on checkpointing strategies , 2012, ArXiv.

[39]  Domenico Cotroneo,et al.  Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).