Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

HPC systems are complex machines that generate a huge volume of system state data called "events". Events are generated without following a general consistent rule and different hardware and software components of such systems have different failure rates. Distinguishing between normal system behaviour and faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for system administration and is the foundation of fault prediction. As HPC systems continue to grow in size and complexity, mining event flows become more challenging and with the upcoming 10 Pet flop systems, there is a lot of interest in this topic. Current event mining approaches do not take into consideration the specific behaviour of each type of events and as a consequence, fail to analyze them according to their characteristics. In this paper we propose a novel way of characterizing the normal and faulty behaviour of the system by using signal analysis concepts. All analysis modules create ELSA (Event Log Signal Analyzer), a toolkit that has the purpose of modelling the normal flow of each state event during a HPC system lifetime, and how it is affected when a failure hits the system. We show that these extracted models provide an accurate view of the system output, which improves the effectiveness of proactive fault tolerance algorithms. Specifically, we implemented a filtering algorithm and short-term fault prediction methodology based on the extracted model and test it against real failure traces from a large-scale system. We show that by analyzing each event according to its specific behaviour, we get a more realistic overview of the entire system.

[1]  Alexandru Iosup,et al.  Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[2]  Franck Cappello,et al.  Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.

[3]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[4]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[5]  Zhiling Lan,et al.  Anomaly localization in large-scale clusters , 2007, 2007 IEEE International Conference on Cluster Computing.

[6]  Harry O. Posten,et al.  The robustness of the two—sample t—test over the Pearson system , 1978 .

[7]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[8]  Nithin Nakka,et al.  Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[10]  Sreeram Chandrasekar,et al.  An efficient methodology for noise characterization , 2005, 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design.

[11]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[12]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[13]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Franck Cappello,et al.  Event Log Mining Tool for Large Scale HPC Systems , 2011, Euro-Par.

[15]  Aleksandra Pizurica,et al.  Removal of Correlated Noise by Modeling the Signal of Interest in the Wavelet Domain , 2009, IEEE Transactions on Image Processing.

[16]  W.M. Waters,et al.  Bandpass Signal Sampling and Coherent Detection , 1982, IEEE Transactions on Aerospace and Electronic Systems.

[17]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Qiang Fu,et al.  Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.

[19]  Lakshminarayanan Subramanian,et al.  Root Cause Localization in Large Scale Systems , 2005 .

[20]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[21]  Min-Jea Tahk,et al.  IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS , 2022, IEEE Aerospace and Electronic Systems Magazine.

[22]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[23]  Ling Huang,et al.  Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[24]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[25]  Christian Engelmann,et al.  Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.

[26]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.