Fault prediction under the microscope: A closer look into HPC systems

A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpointrestart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. Thus far, research in this field has used ideal predictors that were not implemented in real HPC systems. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA (Event Log Signal Analyzer) toolkit and offer an adaptive and more efficient prediction module. Our goal is to provide models that characterize the normal behavior of a system and the way faults affect it. Being able to detect deviations from normality quickly is the foundation of accurate fault prediction. However, this is challenging because component failure dynamics are heterogeneous in space and time. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems and by investigating the characteristics and bottlenecks of each step of the prediction process. Furthermore, we analyze the prediction's precision and recall impact on current checkpointing strategies and highlight future improvements and directions for research in this field.

[1]  Josep Torrellas,et al.  Rebound: Scalable checkpointing for coherent shared memory , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[2]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[4]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[5]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[6]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[7]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[8]  Ling Huang,et al.  Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[9]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Maguelonne Teisseire,et al.  Mining Frequent Gradual Itemsets from Large Databases , 2009, IDA.

[11]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[12]  Roy C. Milton,et al.  An Extended Table of Critical Values for the Mann-Whitney (Wilcoxon) Two-Sample Statistic , 1964 .

[13]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[14]  Alexandru Iosup,et al.  Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[15]  Franck Cappello,et al.  Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.

[16]  Franck Cappello,et al.  Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.

[17]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[18]  Franck Cappello,et al.  Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.

[19]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Christian Engelmann,et al.  Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.

[21]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[22]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[23]  Qiang Fu,et al.  Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.

[24]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[26]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[27]  Alexandre Termier,et al.  PGP-mc: Towards a Multicore Parallel Approach for Mining Gradual Patterns , 2010, DASFAA.

[28]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[29]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.