Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems

The demands of increasingly large scientific application workflows lead to the need for more powerful supercomputers. As the scale of supercomputing systems have grown, the prediction of fault tolerance has become an increasingly critical area of study, since the prediction of system failures can improve performance by saving checkpoints in advance. We propose a real-time failure detection algorithm that adopts an event-based prediction model. The prediction model is a convolutional neural network that utilizes both traditional event attributes and additional spatio-temporal features. We present a case study using our proposed method with six years of reliability, availability, and serviceability event logs recorded by Mira, a Blue Gene/Q supercomputer at Argonne National Laboratory. In the case study, we have shown that our failure prediction model is not limited to predict the occurrence of failures in general. It is capable of accurately detecting specific types of critical failures such as coolant and power problems within reasonable lead time ranges. Our case study shows that the proposed method can achieve a F1 score of 0.56 for general failures, 0.97 for coolant failures, and 0.86 for power failures.

[1]  Terry Jones,et al.  Accurate fault prediction of BlueGene/P RAS logs via geometric reduction , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[2]  Franck Cappello,et al.  Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System , 2019, IEEE Transactions on Parallel and Distributed Systems.

[3]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[4]  Franck Cappello,et al.  LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[5]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[6]  R. Capuzzo-Dolcetta PRACE: Partnership for Advanced Computing in Europe , 2010 .

[7]  Franck Cappello,et al.  Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[8]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[9]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Franck Cappello,et al.  Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Anand Sivasubramaniam,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[12]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[13]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[14]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[15]  Zhiling Lan,et al.  Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[16]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[17]  Kesheng Wu,et al.  Towards Autonomic Science Infrastructure: Architecture, Limitations, and Open Issues , 2018, AI-Science@HPDC.

[18]  Ping Huang,et al.  Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[19]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[20]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[21]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.