Failure prediction : what to do with unpredicted failures ?

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Several key results have demonstrated that recent advances in event log analysis can provide precise failure prediction. The state of the art in failure prediction provides a ratio of correctly identified failures to the number of all predicted failures of over 90% and able to discover around 50% of all failures in a system. However, large parts of failures are not predicted and are considered as false negative alerts. Therefore, developing efficient fault tolerance strategies to tolerate failures requires a good perception and understanding of failure prediction characteristics. To understand the properties of false negative alerts, we conducted a statistical analysis of the probability distribution of such alerts and their impact on fault tolerance techniques. specifically we studied failures logs from different HPC production systems. We show that (i) the false negative distribution has the same nature as the failure distribution (ii) After adding failure prediction, we were able to infer statistical models that describe the inter-arrival time between false negative alerts and hence current fault tolerance can be applied to these systems. Moreover, we show that the current failures traces have a high correlation between the failure inter-arrival time that can be used to improve the failure prediction mechanism. Another important result is that checkpoint intervals can still be computed from an existing first-order formula.

[1]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[3]  John T. Daly,et al.  Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.

[4]  James D. Knoke Testing for randomness against autocorrelation: Alternative tests , 1977 .

[5]  J. Wolfowitz,et al.  On a Test Whether Two Samples are from the Same Population , 1940 .

[6]  Jean-Marc Vincent,et al.  Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.

[7]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[8]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[9]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[10]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[11]  J. V. Bradley Distribution-Free Statistical Tests , 1968 .

[12]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[13]  Christian Engelmann,et al.  Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.

[14]  Franck Cappello,et al.  Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.

[15]  Dhabaleswar K. Panda,et al.  Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[16]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[17]  Yuanyuan Zhou,et al.  Understanding Customer Problem Troubleshooting from Storage System Logs , 2009, FAST.

[18]  Emmanuel Jeannot,et al.  Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics , 2012, J. Parallel Distributed Comput..

[19]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[20]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[21]  D. B. Owen,et al.  Confidence intervals for the coefficient of variation for the normal and log normal distributions , 1964 .

[22]  Denis Trystram,et al.  On the Scheduling of Checkpoints in Desktop Grids , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[23]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[24]  Dror G. Feitelson,et al.  Workload Modeling for Performance Evaluation , 2002, Performance.

[25]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[26]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[27]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[28]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[29]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[32]  Jon Stearley,et al.  A State-Machine Approach to Disambiguating Supercomputer Event Logs , 2012, MAD.